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Preface 


This  book  is  about  convex  optimization,  a  special  class  of  mathematical  optimiza¬ 
tion  problems,  which  includes  least-squares  and  linear  programming  problems.  It 
is  well  known  that  least-squares  and  linear  programming  problems  have  a  fairly 
complete  theory,  arise  in  a  variety  of  applications,  and  can  be  solved  numerically 
very  efficiently.  The  basic  point  of  this  book  is  that  the  same  can  be  said  for  the 
larger  class  of  convex  optimization  problems. 

While  the  mathematics  of  convex  optimization  has  been  studied  for  about  a 
century,  several  related  recent  developments  have  stimulated  new  interest  in  the 
topic.  The  first  is  the  recognition  that  interior-point  methods,  developed  in  the 
1980s  to  solve  linear  programming  problems,  can  be  used  to  solve  convex  optimiza¬ 
tion  problems  as  well.  These  new  methods  allow  us  to  solve  certain  new  classes 
of  convex  optimization  problems,  such  as  semidefinite  programs  and  second-order 
cone  programs,  almost  as  easily  as  linear  programs. 

The  second  development  is  the  discovery  that  convex  optimization  problems 
(beyond  least-squares  and  linear  programs)  are  more  prevalent  in  practice  than 
was  previously  thought.  Since  1990  many  applications  have  been  discovered  in 
areas  such  as  automatic  control  systems,  estimation  and  signal  processing,  com¬ 
munications  and  networks,  electronic  circuit  design,  data  analysis  and  modeling, 
statistics,  and  finance.  Convex  optimization  has  also  found  wide  application  in  com¬ 
binatorial  optimization  and  global  optimization,  where  it  is  used  to  find  bounds  on 
the  optimal  value,  as  well  as  approximate  solutions.  We  believe  that  many  other 
applications  of  convex  optimization  are  still  waiting  to  be  discovered. 

There  are  great  advantages  to  recognizing  or  formulating  a  problem  as  a  convex 
optimization  problem.  The  most  basic  advantage  is  that  the  problem  can  then  be 
solved,  very  reliably  and  efficiently,  using  interior-point  methods  or  other  special 
methods  for  convex  optimization.  These  solution  methods  are  reliable  enough  to  be 
embedded  in  a  computer-aided  design  or  analysis  tool,  or  even  a  real-time  reactive 
or  automatic  control  system.  There  are  also  theoretical  or  conceptual  advantages 
of  formulating  a  problem  as  a  convex  optimization  problem.  The  associated  dual 
problem,  for  example,  often  has  an  interesting  interpretation  in  terms  of  the  original 
problem,  and  sometimes  leads  to  an  efficient  or  distributed  method  for  solving  it. 

We  think  that  convex  optimization  is  an  important  enough  topic  that  everyone 
who  uses  computational  mathematics  should  know  at  least  a  little  bit  about  it. 
In  our  opinion,  convex  optimization  is  a  natural  next  topic  after  advanced  linear 
algebra  (topics  like  least-squares,  singular  values),  and  linear  programming. 
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Goal  of  this  book 

For  many  general  purpose  optimization  methods,  the  typical  approach  is  to  just 
try  out  the  method  on  the  problem  to  be  solved.  The  full  benefits  of  convex 
optimization,  in  contrast,  only  come  when  the  problem  is  known  ahead  of  time  to 
be  convex.  Of  course,  many  optimization  problems  are  not  convex,  and  it  can  be 
difficult  to  recognize  the  ones  that  are,  or  to  reformulate  a  problem  so  that  it  is 
convex. 

Our  main  goal  is  to  help  the  reader  develop  a  working  knowledge  of 
convex  optimization,  i.e.,  to  develop  the  skills  and  background  needed 
to  recognize,  formulate,  and  solve  convex  optimization  problems. 

Developing  a  working  knowledge  of  convex  optimization  can  be  mathematically 
demanding,  especially  for  the  reader  interested  primarily  in  applications.  In  our 
experience  (mostly  with  graduate  students  in  electrical  engineering  and  computer 
science),  the  investment  often  pays  off  well,  and  sometimes  very  well. 

There  are  several  books  on  linear  programming,  and  general  nonlinear  pro¬ 
gramming,  that  focus  on  problem  formulation,  modeling,  and  applications.  Several 
other  books  cover  the  theory  of  convex  optimization,  or  interior-point  methods  and 
their  complexity  analysis.  This  book  is  meant  to  be  something  in  between,  a  book 
on  general  convex  optimization  that  focuses  on  problem  formulation  and  modeling. 

We  should  also  mention  what  this  book  is  not.  It  is  not  a  text  primarily  about 
convex  analysis,  or  the  mathematics  of  convex  optimization;  several  existing  texts 
cover  these  topics  well.  Nor  is  the  book  a  survey  of  algorithms  for  convex  optimiza¬ 
tion.  Instead  we  have  chosen  just  a  few  good  algorithms,  and  describe  only  simple, 
stylized  versions  of  them  (which,  however,  do  work  well  in  practice).  We  make  no 
attempt  to  cover  the  most  recent  state  of  the  art  in  interior-point  (or  other)  meth¬ 
ods  for  solving  convex  problems.  Our  coverage  of  numerical  implementation  issues 
is  also  highly  simplified,  but  we  feel  that  it  is  adequate  for  the  potential  user  to 
develop  working  implementations,  and  we  do  cover,  in  some  detail,  techniques  for 
exploiting  structure  to  improve  the  efficiency  of  the  methods.  We  also  do  not  cover, 
in  more  than  a  simplified  way,  the  complexity  theory  of  the  algorithms  we  describe. 
We  do,  however,  give  an  introduction  to  the  important  ideas  of  self-concordance 
and  complexity  analysis  for  interior-point  methods. 

Audience 

This  book  is  meant  for  the  researcher,  scientist,  or  engineer  who  uses  mathemat¬ 
ical  optimization,  or  more  generally,  computational  mathematics.  This  includes, 
naturally,  those  working  directly  in  optimization  and  operations  research,  and  also 
many  others  who  use  optimization,  in  fields  like  computer  science,  economics,  fi¬ 
nance,  statistics,  data  mining,  and  many  fields  of  science  and  engineering.  Our 
primary  focus  is  on  the  latter  group,  the  potential  users  of  convex  optimization, 
and  not  the  (less  numerous)  experts  in  the  field  of  convex  optimization. 

The  only  background  required  of  the  reader  is  a  good  knowledge  of  advanced 
calculus  and  linear  algebra.  If  the  reader  has  seen  basic  mathematical  analysis  ( e.g ., 
norms,  convergence,  elementary  topology),  and  basic  probability  theory,  he  or  she 
should  be  able  to  follow  every  argument  and  discussion  in  the  book.  We  hope  that 
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readers  who  have  not  seen  analysis  and  probability,  however,  can  still  get  all  of  the 
essential  ideas  and  important  points.  Prior  exposure  to  numerical  computing  or 
optimization  is  not  needed,  since  we  develop  all  of  the  needed  material  from  these 
areas  in  the  text  or  appendices. 

Using  this  book  in  courses 

We  hope  that  this  book  will  be  useful  as  the  primary  or  alternate  textbook  for 
several  types  of  courses.  Since  1995  we  have  been  using  drafts  of  this  book  for 
graduate  courses  on  linear,  nonlinear,  and  convex  optimization  (with  engineering 
applications)  at  Stanford  and  UCLA.  We  are  able  to  cover  most  of  the  material, 
though  not  in  detail,  in  a  one  quarter  graduate  course.  A  one  semester  course  allows 
for  a  more  leisurely  pace,  more  applications,  more  detailed  treatment  of  theory, 
and  perhaps  a  short  student  project.  A  two  quarter  sequence  allows  an  expanded 
treatment  of  the  more  basic  topics  such  as  linear  and  quadratic  programming  (which 
are  very  useful  for  the  applications  oriented  student),  or  a  more  substantial  student 
project. 

This  book  can  also  be  used  as  a  reference  or  alternate  text  for  a  more  traditional 
course  on  linear  and  nonlinear  optimization,  or  a  course  on  control  systems  (or 
other  applications  area),  that  includes  some  coverage  of  convex  optimization.  As 
the  secondary  text  in  a  more  theoretically  oriented  course  on  convex  optimization, 
it  can  be  used  as  a  source  of  simple  practical  examples. 
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Chapter  1 

Introduction 


In  this  introduction  we  give  an  overview  of  mathematical  optimization,  focusing  on 
the  special  role  of  convex  optimization.  The  concepts  introduced  informally  here 
will  be  covered  in  later  chapters,  with  more  care  and  technical  detail. 


1.1  Mathematical  optimization 

A  mathematical  optimization  problem ,  or  just  optimization  problem,  has  the  form 

minimize  f0{x)  ^ 

subject  to  fi{x)  <  hi,  i  =  1, . . .  ,m.  ' 

Here  the  vector  x  =  (aq, . . .  ,xn)  is  the  optimization  variable  of  the  problem,  the 
function  fo  :  R"  — >  R  is  the  objective  function,  the  functions  /)  :  R"  — >  R, 
i  =  1, . . . ,  m,  are  the  (inequality)  constraint  functions,  and  the  constants  b\,. . .  ,bm 
are  the  limits,  or  bounds,  for  the  constraints.  A  vector  x*  is  called  optimal,  or  a 
solution  of  the  problem  (1.1),  if  it  has  the  smallest  objective  value  among  all  vectors 
that  satisfy  the  constraints:  for  any  z  with  fi(z)  <  bi, ,  fm(z)  <  bm,  we  have 

fo(z)  >  fo(x*). 

We  generally  consider  families  or  classes  of  optimization  problems,  characterized 
by  particular  forms  of  the  objective  and  constraint  functions.  As  an  important 
example,  the  optimization  problem  (1.1)  is  called  a  linear  program  if  the  objective 
and  constraint  functions  fo,  ■  ■  ■ ,  fm  are  linear,  i.e.,  satisfy 

fi{ax  +  fjy)  =  afi(x)  +  f3fi(y)  (1.2) 

for  all  x,  y  £  R™  and  all  a,  j3  £  R.  If  the  optimization  problem  is  not  linear,  it  is 
called  a  nonlinear  program. 

This  book  is  about  a  class  of  optimization  problems  called  convex  optimiza¬ 
tion  problems.  A  convex  optimization  problem  is  one  in  which  the  objective  and 
constraint  functions  are  convex,  which  means  they  satisfy  the  inequality 


fi{ax  +  j3y)  <  affx)  +  f3fi{y) 


(1.3) 
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for  all  x,  y  G  R”  and  all  a,  (3  €  R  with  a  +  /?  =  1,  a  >  0,  /?  >  0.  Comparing  (1.3) 
and  (1.2),  we  see  that  convexity  is  more  general  than  linearity:  inequality  replaces 
the  more  restrictive  equality,  and  the  inequality  must  hold  only  for  certain  values 
of  a  and  (3 .  Since  any  linear  program  is  therefore  a  convex  optimization  problem, 
we  can  consider  convex  optimization  to  be  a  generalization  of  linear  programming. 


1.1.1  Applications 

The  optimization  problem  (1.1)  is  an  abstraction  of  the  problem  of  making  the  best 
possible  choice  of  a  vector  in  R"  from  a  set  of  candidate  choices.  The  variable  x 
represents  the  choice  made;  the  constraints  fi(x)  <  6,;  represent  firm  requirements 
or  specifications  that  limit  the  possible  choices,  and  the  objective  value  /0(x)  rep¬ 
resents  the  cost  of  choosing  x.  (We  can  also  think  of  — /0( x)  as  representing  the 
value,  or  utility,  of  choosing  x.)  A  solution  of  the  optimization  problem  (1.1)  corre¬ 
sponds  to  a  choice  that  has  minimum  cost  (or  maximum  utility),  among  all  choices 
that  meet  the  firm  requirements. 

In  portfolio  optimization ,  for  example,  we  seek  the  best  way  to  invest  some 
capital  in  a  set  of  n  assets.  The  variable  x.-t  represents  the  investment  in  the  ?’th 
asset,  so  the  vector  x  €  Rn  describes  the  overall  portfolio  allocation  across  the  set  of 
assets.  The  constraints  might  represent  a  limit  on  the  budget  (i.e.,  a  limit  on  the 
total  amount  to  be  invested),  the  requirement  that  investments  are  nonnegative 
(assuming  short  positions  are  not  allowed),  and  a  minimum  acceptable  value  of 
expected  return  for  the  whole  portfolio.  The  objective  or  cost  function  might  be 
a  measure  of  the  overall  risk  or  variance  of  the  portfolio  return.  In  this  case, 
the  optimization  problem  (1.1)  corresponds  to  choosing  a  portfolio  allocation  that 
minimizes  risk,  among  all  possible  allocations  that  meet  the  firm  requirements. 

Another  example  is  device  sizing  in  electronic  design,  which  is  the  task  of  choos¬ 
ing  the  width  and  length  of  each  device  in  an  electronic  circuit.  Here  the  variables 
represent  the  widths  and  lengths  of  the  devices.  The  constraints  represent  a  va¬ 
riety  of  engineering  requirements,  such  as  limits  on  the  device  sizes  imposed  by 
the  manufacturing  process,  timing  requirements  that  ensure  that  the  circuit  can 
operate  reliably  at  a  specified  speed,  and  a  limit  on  the  total  area  of  the  circuit.  A 
common  objective  in  a  device  sizing  problem  is  the  total  power  consumed  by  the 
circuit.  The  optimization  problem  (1.1)  is  to  find  the  device  sizes  that  satisfy  the 
design  requirements  (on  manufacturability,  timing,  and  area)  and  are  most  power 
efficient. 

In  data  fitting ,  the  task  is  to  find  a  model,  from  a  family  of  potential  models, 
that  best  fits  some  observed  data  and  prior  information.  Here  the  variables  are  the 
parameters  in  the  model,  and  the  constraints  can  represent  prior  information  or 
required  limits  on  the  parameters  (such  as  nonnegativity).  The  objective  function 
might  be  a  measure  of  misfit  or  prediction  error  between  the  observed  data  and 
the  values  predicted  by  the  model,  or  a  statistical  measure  of  the  unlikeliness  or 
implausibility  of  the  parameter  values.  The  optimization  problem  (1.1)  is  to  find 
the  model  parameter  values  that  are  consistent  with  the  prior  information,  and  give 
the  smallest  misfit  or  prediction  error  with  the  observed  data  (or,  in  a  statistical 
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framework,  are  most  likely). 

An  amazing  variety  of  practical  problems  involving  decision  making  (or  system 
design,  analysis,  and  operation)  can  be  cast  in  the  form  of  a  mathematical  opti¬ 
mization  problem,  or  some  variation  such  as  a  multicriterion  optimization  problem. 
Indeed,  mathematical  optimization  has  become  an  important  tool  in  many  areas. 
It  is  widely  used  in  engineering,  in  electronic  design  automation,  automatic  con¬ 
trol  systems,  and  optimal  design  problems  arising  in  civil,  chemical,  mechanical, 
and  aerospace  engineering.  Optimization  is  used  for  problems  arising  in  network 
design  and  operation,  finance,  supply  chain  management,  scheduling,  and  many 
other  areas.  The  list  of  applications  is  still  steadily  expanding. 

For  most  of  these  applications,  mathematical  optimization  is  used  as  an  aid  to 
a  human  decision  maker,  system  designer,  or  system  operator,  who  supervises  the 
process,  checks  the  results,  and  modifies  the  problem  (or  the  solution  approach) 
when  necessary.  This  human  decision  maker  also  carries  out  any  actions  suggested 
by  the  optimization  problem,  e.g.,  buying  or  selling  assets  to  achieve  the  optimal 
portfolio. 

A  relatively  recent  phenomenon  opens  the  possibility  of  many  other  applications 
for  mathematical  optimization.  With  the  proliferation  of  computers  embedded  in 
products,  we  have  seen  a  rapid  growth  in  embedded  optimization.  In  these  em¬ 
bedded  applications,  optimization  is  used  to  automatically  make  real-time  choices, 
and  even  carry  out  the  associated  actions,  with  no  (or  little)  human  intervention  or 
oversight.  In  some  application  areas,  this  blending  of  traditional  automatic  control 
systems  and  embedded  optimization  is  well  under  way;  in  others,  it  is  just  start¬ 
ing.  Embedded  real-time  optimization  raises  some  new  challenges:  in  particular, 
it  requires  solution  methods  that  are  extremely  reliable,  and  solve  problems  in  a 
predictable  amount  of  time  (and  memory). 


1.1.2  Solving  optimization  problems 

A  solution  method  for  a  class  of  optimization  problems  is  an  algorithm  that  com¬ 
putes  a  solution  of  the  problem  (to  some  given  accuracy) ,  given  a  particular  problem 
from  the  class,  i.e.,  an  instance  of  the  problem.  Since  the  late  1940s,  a  large  effort 
has  gone  into  developing  algorithms  for  solving  various  classes  of  optimization  prob¬ 
lems,  analyzing  their  properties,  and  developing  good  software  implementations. 
The  effectiveness  of  these  algorithms,  i.e.,  our  ability  to  solve  the  optimization  prob¬ 
lem  (1.1),  varies  considerably,  and  depends  on  factors  such  as  the  particular  forms 
of  the  objective  and  constraint  functions,  how  many  variables  and  constraints  there 
are,  and  special  structure,  such  as  sparsity.  (A  problem  is  sparse  if  each  constraint 
function  depends  on  only  a  small  number  of  the  variables) . 

Even  when  the  objective  and  constraint  functions  are  smooth  (for  example, 
polynomials)  the  general  optimization  problem  (1.1)  is  surprisingly  difficult  to  solve. 
Approaches  to  the  general  problem  therefore  involve  some  kind  of  compromise,  such 
as  very  long  computation  time,  or  the  possibility  of  not  finding  the  solution.  Some 
of  these  methods  are  discussed  in  §1.4. 

There  are,  however,  some  important  exceptions  to  the  general  rule  that  most 
optimization  problems  are  difficult  to  solve.  For  a  few  problem  classes  we  have 
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effective  algorithms  that  can  reliably  solve  even  large  problems,  with  hundreds  or 
thousands  of  variables  and  constraints.  Two  important  and  well  known  examples, 
described  in  §1.2  below  (and  in  detail  in  chapter  4),  are  least-squares  problems  and 
linear  programs.  It  is  less  well  known  that  convex  optimization  is  another  exception 
to  the  rule:  Like  least-squares  or  linear  programming,  there  are  very  effective 
algorithms  that  can  reliably  and  efficiently  solve  even  large  convex  problems. 


1.2  Least-squares  and  linear  programming 

In  this  section  we  describe  two  very  widely  known  and  used  special  subclasses  of 
convex  optimization:  least-squares  and  linear  programming.  (A  complete  technical 
treatment  of  these  problems  will  be  given  in  chapter  4.) 


1.2.1  Least-squares  problems 

A  least-squares  problem  is  an  optimization  problem  with  no  constraints  (i.e.,  m  = 
0)  and  an  objective  which  is  a  sum  of  squares  of  terms  of  the  form  afx  —  bp. 

minimize  f0(x)  =  ||  Ax  -  b\\l  =  J2i=i(aIx~  bi)2-  (l-4) 

Here  A  £  Rfcx"  (with  k  >  n),  af  are  the  rows  of  A,  and  the  vector  x  £  Rn  is  the 
optimization  variable. 

Solving  least-squares  problems 

The  solution  of  a  least-squares  problem  (1.4)  can  be  reduced  to  solving  a  set  of 
linear  equations, 

(AtA)x  =  ATb , 

so  we  have  the  analytical  solution  x  =  (AT A)~1ATb.  For  least-squares  problems 
we  have  good  algorithms  (and  software  implementations)  for  solving  the  problem  to 
high  accuracy,  with  very  high  reliability.  The  least-squares  problem  can  be  solved 
in  a  time  approximately  proportional  to  n2k,  with  a  known  constant.  A  current 
desktop  computer  can  solve  a  least-squares  problem  with  hundreds  of  variables,  and 
thousands  of  terms,  in  a  few  seconds;  more  powerful  computers,  of  course,  can  solve 
larger  problems,  or  the  same  size  problems,  faster.  (Moreover,  these  solution  times 
will  decrease  exponentially  in  the  future,  according  to  Moore’s  law.)  Algorithms 
and  software  for  solving  least-squares  problems  are  reliable  enough  for  embedded 
optimization. 

In  many  cases  we  can  solve  even  larger  least-squares  problems,  by  exploiting 
some  special  structure  in  the  coefficient  matrix  A.  Suppose,  for  example,  that  the 
matrix  A  is  sparse,  which  means  that  it  has  far  fewer  than  kn  nonzero  entries.  By 
exploiting  sparsity,  we  can  usually  solve  the  least-squares  problem  much  faster  than 
order  n2k.  A  current  desktop  computer  can  solve  a  sparse  least-squares  problem 
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with  tens  of  thousands  of  variables,  and  hundreds  of  thousands  of  terms,  in  around 
a  minute  (although  this  depends  on  the  particular  sparsity  pattern). 

For  extremely  large  problems  (say,  with  millions  of  variables),  or  for  problems 
with  exacting  real-time  computing  requirements,  solving  a  least-squares  problem 
can  be  a  challenge.  But  in  the  vast  majority  of  cases,  we  can  say  that  existing 
methods  are  very  effective,  and  extremely  reliable.  Indeed,  we  can  say  that  solving 
least-squares  problems  (that  are  not  on  the  boundary  of  what  is  currently  achiev¬ 
able)  is  a  (mature)  technology ,  that  can  be  reliably  used  by  many  people  who  do 
not  know,  and  do  not  need  to  know,  the  details. 

Using  least-squares 

The  least-squares  problem  is  the  basis  for  regression  analysis,  optimal  control,  and 
many  parameter  estimation  and  data  fitting  methods.  It  has  a  number  of  statistical 
interpretations,  e.g.,  as  maximum  likelihood  estimation  of  a  vector  x,  given  linear 
measurements  corrupted  by  Gaussian  measurement  errors. 

Recognizing  an  optimization  problem  as  a  least-squares  problem  is  straightfor¬ 
ward;  we  only  need  to  verify  that  the  objective  is  a  quadratic  function  (and  then 
test  whether  the  associated  quadratic  form  is  positive  semidefinite).  While  the 
basic  least-squares  problem  has  a  simple  fixed  form,  several  standard  techniques 
are  used  to  increase  its  flexibility  in  applications. 

In  weighted  least-squares,  the  weighted  least-squares  cost 

k 

Ywj{«jX  ~  &i)2> 

i= 1 

where  u>i, . . .  ,u>k  are  positive,  is  minimized.  (This  problem  is  readily  cast  and 
solved  as  a  standard  least-squares  problem.)  Here  the  weights  wy  are  chosen  to 
reflect  differing  levels  of  concern  about  the  sizes  of  the  terms  aj  x  —  bi,  or  simply 
to  influence  the  solution.  In  a  statistical  setting,  weighted  least-squares  arises 
in  estimation  of  a  vector  x,  given  linear  measurements  corrupted  by  errors  with 
unequal  variances. 

Another  technique  in  least-squares  is  regularization,  in  which  extra  terms  are 
added  to  the  cost  function.  In  the  simplest  case,  a  positive  multiple  of  the  sum  of 
squares  of  the  variables  is  added  to  the  cost  function: 

k  n 

J2(ajx -bi)2  + 

i— 1  i= 1 

where  p  >  0.  (This  problem  too  can  be  formulated  as  a  standard  least-squares 
problem.)  The  extra  terms  penalize  large  values  of  x,  and  result  in  a  sensible 
solution  in  cases  when  minimizing  the  first  sum  only  does  not.  The  parameter  p  is 
chosen  by  the  user  to  give  the  right  trade-off  between  making  the  original  objective 
function  x~bi)2  small,  while  keeping  xi  n°t  too  big.  Regularization 

comes  up  in  statistical  estimation  when  the  vector  x  to  be  estimated  is  given  a  prior 
distribution. 

Weighted  least-squares  and  regularization  are  covered  in  chapter  6;  their  sta¬ 
tistical  interpretations  are  given  in  chapter  7. 
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1.2.2  Linear  programming 

Another  important  class  of  optimization  problems  is  linear  programming ,  in  which 
the  objective  and  all  constraint  functions  are  linear: 

minimize  cTx  ,  . 

subject  to  afx<bi,  i  =  l,...,m.  v 

Here  the  vectors  c,  ai , . . . , am  G  R"  and  scalars  bi, ...  ,bm  G  R  are  problem  pa¬ 
rameters  that  specify  the  objective  and  constraint  functions. 

Solving  linear  programs 

There  is  no  simple  analytical  formula  for  the  solution  of  a  linear  program  (as  there 
is  for  a  least-squares  problem),  but  there  are  a  variety  of  very  effective  methods  for 
solving  them,  including  Dantzig’s  simplex  method,  and  the  more  recent  interior- 
point  methods  described  later  in  this  book.  While  we  cannot  give  the  exact  number 
of  arithmetic  operations  required  to  solve  a  linear  program  (as  we  can  for  least- 
squares),  we  can  establish  rigorous  bounds  on  the  number  of  operations  required 
to  solve  a  linear  program,  to  a  given  accuracy,  using  an  interior-point  method.  The 
complexity  in  practice  is  order  n2m  (assuming  m  >  n)  but  with  a  constant  that  is 
less  well  characterized  than  for  least-squares.  These  algorithms  are  quite  reliable, 
although  perhaps  not  quite  as  reliable  as  methods  for  least-squares.  We  can  easily 
solve  problems  with  hundreds  of  variables  and  thousands  of  constraints  on  a  small 
desktop  computer,  in  a  matter  of  seconds.  If  the  problem  is  sparse,  or  has  some 
other  exploitable  structure,  we  can  often  solve  problems  with  tens  or  hundreds  of 
thousands  of  variables  and  constraints. 

As  with  least-squares  problems,  it  is  still  a  challenge  to  solve  extremely  large 
linear  programs,  or  to  solve  linear  programs  with  exacting  real-time  computing  re¬ 
quirements.  But,  like  least-squares,  we  can  say  that  solving  (most)  linear  programs 
is  a  mature  technology.  Linear  programming  solvers  can  be  (and  are)  embedded  in 
many  tools  and  applications. 

Using  linear  programming 

Some  applications  lead  directly  to  linear  programs  in  the  form  (1.5),  or  one  of 
several  other  standard  forms.  In  many  other  cases  the  original  optimization  prob¬ 
lem  does  not  have  a  standard  linear  program  form,  but  can  be  transformed  to  an 
equivalent  linear  program  (and  then,  of  course,  solved)  using  techniques  covered  in 
detail  in  chapter  4. 

As  a  simple  example,  consider  the  Chebyshev  approximation  problem: 

minimize  maxj=ii...)fc  \afx  —  bi\.  (1.6) 

Here  x  G  R"  is  the  variable,  and  oi, . . . ,  a*,  G  R",  b±, . . .  ,bk  G  R  are  parameters 
that  specify  the  problem  instance.  Note  the  resemblance  to  the  least-squares  prob¬ 
lem  (1.4).  For  both  problems,  the  objective  is  a  measure  of  the  size  of  the  terms 
af  x  —  bi .  In  least-squares,  we  use  the  sum  of  squares  of  the  terms  as  objective, 
whereas  in  Chebyshev  approximation,  we  use  the  maximum  of  the  absolute  values. 
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One  other  important  distinction  is  that  the  objective  function  in  the  Chebyshev 
approximation  problem  (1.6)  is  not  differentiable;  the  objective  in  the  least-squares 
problem  (1.4)  is  quadratic,  and  therefore  differentiable. 

The  Chebyshev  approximation  problem  (1.6)  can  be  solved  by  solving  the  linear 
program 

minimize  t 

subject  to  afx  —  t<bi,  i  =  l,...,k  (1.7) 

with  variables  x  £  R"  and  t  £  R.  (The  details  will  be  given  in  chapter  6.) 
Since  linear  programs  are  readily  solved,  the  Chebyshev  approximation  problem  is 
therefore  readily  solved. 

Anyone  with  a  working  knowledge  of  linear  programming  would  recognize  the 
Chebyshev  approximation  problem  (1.6)  as  one  that  can  be  reduced  to  a  linear 
program.  For  those  without  this  background,  though,  it  might  not  be  obvious  that 
the  Chebyshev  approximation  problem  (1.6),  with  its  nondifferentiable  objective, 
can  be  formulated  and  solved  as  a  linear  program. 

While  recognizing  problems  that  can  be  reduced  to  linear  programs  is  more 
involved  than  recognizing  a  least-squares  problem,  it  is  a  skill  that  is  readily  ac¬ 
quired,  since  only  a  few  standard  tricks  are  used.  The  task  can  even  be  partially 
automated;  some  software  systems  for  specifying  and  solving  optimization  prob¬ 
lems  can  automatically  recognize  (some)  problems  that  can  be  reformulated  as 
linear  programs. 


1.3  Convex  optimization 

A  convex  optimization  problem  is  one  of  the  form 

minimize  f0(x)  ^ 

subject  to  fi(x)  <bi ,  i  =  1, . . .  ,m,  \  ■  J 

where  the  functions  /o,  •  •  • ,  fm  '■  R"  — >  R  are  convex,  i.e.,  satisfy 

fi{ax  +  fiy)  <  afi(x)  +  fifi(y) 

for  all  x,  y  £  Rn  and  all  a,  £  R  with  a  +  /3  =  l,a>0,  /?  >  0.  The  least-squares 
problem  (1.4)  and  linear  programming  problem  (1.5)  are  both  special  cases  of  the 
general  convex  optimization  problem  (1.8). 


1.3.1  Solving  convex  optimization  problems 

There  is  in  general  no  analytical  formula  for  the  solution  of  convex  optimization 
problems,  but  (as  with  linear  programming  problems)  there  are  very  effective  meth¬ 
ods  for  solving  them.  Interior-point  methods  work  very  well  in  practice,  and  in  some 
cases  can  be  proved  to  solve  the  problem  to  a  specified  accuracy  with  a  number  of 
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operations  that  does  not  exceed  a  polynomial  of  the  problem  dimensions.  (This  is 
covered  in  chapter  11.) 

We  will  see  that  interior-point  methods  can  solve  the  problem  (1.8)  in  a  num¬ 
ber  of  steps  or  iterations  that  is  almost  always  in  the  range  between  10  and  100. 
Ignoring  any  structure  in  the  problem  (such  as  sparsity),  each  step  requires  on  the 
order  of 

max{n3 ,  n2m ,  F  } 

operations,  where  F  is  the  cost  of  evaluating  the  first  and  second  derivatives  of  the 
objective  and  constraint  functions  /o,  •  •  ■ ,  fm- 

Like  methods  for  solving  linear  programs,  these  interior-point  methods  are  quite 
reliable.  We  can  easily  solve  problems  with  hundreds  of  variables  and  thousands 
of  constraints  on  a  current  desktop  computer,  in  at  most  a  few  tens  of  seconds.  By 
exploiting  problem  structure  (such  as  sparsity),  we  can  solve  far  larger  problems, 
with  many  thousands  of  variables  and  constraints. 

We  cannot  yet  claim  that  solving  general  convex  optimization  problems  is  a 
mature  technology,  like  solving  least-squares  or  linear  programming  problems.  Re¬ 
search  on  interior-point  methods  for  general  nonlinear  convex  optimization  is  still 
a  very  active  research  area,  and  no  consensus  has  emerged  yet  as  to  what  the  best 
method  or  methods  are.  But  it  is  reasonable  to  expect  that  solving  general  con¬ 
vex  optimization  problems  will  become  a  technology  within  a  few  years.  And  for 
some  subclasses  of  convex  optimization  problems,  for  example  second-order  cone 
programming  or  geometric  programming  (studied  in  detail  in  chapter  4),  it  is  fair 
to  say  that  interior-point  methods  are  approaching  a  technology. 


1.3.2  Using  convex  optimization 

Using  convex  optimization  is,  at  least  conceptually,  very  much  like  using  least- 
squares  or  linear  programming.  If  we  can  formulate  a  problem  as  a  convex  opti¬ 
mization  problem,  then  we  can  solve  it  efficiently,  just  as  we  can  solve  a  least-squares 
problem  efficiently.  With  only  a  bit  of  exaggeration,  we  can  say  that,  if  you  formu¬ 
late  a  practical  problem  as  a  convex  optimization  problem,  then  you  have  solved 
the  original  problem. 

There  are  also  some  important  differences.  Recognizing  a  least-squares  problem 
is  straightforward,  but  recognizing  a  convex  function  can  be  difficult.  In  addition, 
there  are  many  more  tricks  for  transforming  convex  problems  than  for  transforming 
linear  programs.  Recognizing  convex  optimization  problems,  or  those  that  can 
be  transformed  to  convex  optimization  problems,  can  therefore  be  challenging. 
The  main  goal  of  this  book  is  to  give  the  reader  the  background  needed  to  do 
this.  Once  the  skill  of  recognizing  or  formulating  convex  optimization  problems  is 
developed,  you  will  find  that  surprisingly  many  problems  can  be  solved  via  convex 
optimization. 

The  challenge,  and  art,  in  using  convex  optimization  is  in  recognizing  and  for¬ 
mulating  the  problem.  Once  this  formulation  is  done,  solving  the  problem  is,  like 
least-squares  or  linear  programming,  (almost)  technology. 
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1.4  Nonlinear  optimization 

Nonlinear  optimization  (or  nonlinear  programming)  is  the  term  used  to  describe 
an  optimization  problem  when  the  objective  or  constraint  functions  are  not  linear, 
but  not  known  to  be  convex.  Sadly,  there  are  no  effective  methods  for  solving 
the  general  nonlinear  programming  problem  (1.1).  Even  simple  looking  problems 
with  as  few  as  ten  variables  can  be  extremely  challenging,  while  problems  with  a 
few  hundreds  of  variables  can  be  intractable.  Methods  for  the  general  nonlinear 
programming  problem  therefore  take  several  different  approaches,  each  of  which 
involves  some  compromise. 


1.4.1  Local  optimization 

In  local  optimization,  the  compromise  is  to  give  up  seeking  the  optimal  x,  which 
minimizes  the  objective  over  all  feasible  points.  Instead  we  seek  a  point  that  is 
only  locally  optimal,  which  means  that  it  minimizes  the  objective  function  among 
feasible  points  that  are  near  it,  but  is  not  guaranteed  to  have  a  lower  objective 
value  than  all  other  feasible  points.  A  large  fraction  of  the  research  on  general 
nonlinear  programming  has  focused  on  methods  for  local  optimization,  which  as  a 
consequence  are  well  developed. 

Local  optimization  methods  can  be  fast,  can  handle  large-scale  problems,  and 
are  widely  applicable,  since  they  only  require  differentiability  of  the  objective  and 
constraint  functions.  As  a  result,  local  optimization  methods  are  widely  used  in 
applications  where  there  is  value  in  finding  a  good  point,  if  not  the  very  best.  In 
an  engineering  design  application,  for  example,  local  optimization  can  be  used  to 
improve  the  performance  of  a  design  originally  obtained  by  manual,  or  other,  design 
methods. 

There  are  several  disadvantages  of  local  optimization  methods,  beyond  (possi¬ 
bly)  not  finding  the  true,  globally  optimal  solution.  The  methods  require  an  initial 
guess  for  the  optimization  variable.  This  initial  guess  or  starting  point  is  critical, 
and  can  greatly  affect  the  objective  value  of  the  local  solution  obtained.  Little 
information  is  provided  about  how  far  from  (globally)  optimal  the  local  solution 
is.  Local  optimization  methods  are  often  sensitive  to  algorithm  parameter  values, 
which  may  need  to  be  adjusted  for  a  particular  problem,  or  family  of  problems. 

Using  a  local  optimization  method  is  trickier  than  solving  a  least-squares  prob¬ 
lem,  linear  program,  or  convex  optimization  problem.  It  involves  experimenting 
with  the  choice  of  algorithm,  adjusting  algorithm  parameters,  and  finding  a  good 
enough  initial  guess  (when  one  instance  is  to  be  solved)  or  a  method  for  producing 
a  good  enough  initial  guess  (when  a  family  of  problems  is  to  be  solved).  Roughly 
speaking,  local  optimization  methods  are  more  art  than  technology.  Local  opti¬ 
mization  is  a  well  developed  art,  and  often  very  effective,  but  it  is  nevertheless  an 
art.  In  contrast,  there  is  little  art  involved  in  solving  a  least-squares  problem  or 
a  linear  program  (except,  of  course,  those  on  the  boundary  of  what  is  currently 
possible). 

An  interesting  comparison  can  be  made  between  local  optimization  methods  for 
nonlinear  programming,  and  convex  optimization.  Since  differentiability  of  the  ob- 
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jective  and  constraint  functions  is  the  only  requirement  for  most  local  optimization 
methods,  formulating  a  practical  problem  as  a  nonlinear  optimization  problem  is 
relatively  straightforward.  The  art  in  local  optimization  is  in  solving  the  problem 
(in  the  weakened  sense  of  finding  a  locally  optimal  point),  once  it  is  formulated. 
In  convex  optimization  these  are  reversed:  The  art  and  challenge  is  in  problem 
formulation;  once  a  problem  is  formulated  as  a  convex  optimization  problem,  it  is 
relatively  straightforward  to  solve  it. 


1.4.2  Global  optimization 

In  global  optimization,  the  true  global  solution  of  the  optimization  problem  (1.1) 
is  found;  the  compromise  is  efficiency.  The  worst-case  complexity  of  global  opti¬ 
mization  methods  grows  exponentially  with  the  problem  sizes  n  and  m ;  the  hope 
is  that  in  practice,  for  the  particular  problem  instances  encountered,  the  method  is 
far  faster.  While  this  favorable  situation  does  occur,  it  is  not  typical.  Even  small 
problems,  with  a  few  tens  of  variables,  can  take  a  very  long  time  ( e.g .,  hours  or 
days)  to  solve. 

Global  optimization  is  used  for  problems  with  a  small  number  of  variables,  where 
computing  time  is  not  critical,  and  the  value  of  finding  the  true  global  solution  is 
very  high.  One  example  from  engineering  design  is  worst-case  analysis  or  verifica¬ 
tion  of  a  high  value  or  safety-critical  system.  Here  the  variables  represent  uncertain 
parameters,  that  can  vary  during  manufacturing,  or  with  the  environment  or  op¬ 
erating  condition.  The  objective  function  is  a  utility  function,  i.e.,  one  for  which 
smaller  values  are  worse  than  larger  values,  and  the  constraints  represent  prior 
knowledge  about  the  possible  parameter  values.  The  optimization  problem  (1.1)  is 
the  problem  of  finding  the  worst-case  values  of  the  parameters.  If  the  worst-case 
value  is  acceptable,  we  can  certify  the  system  as  safe  or  reliable  (with  respect  to 
the  parameter  variations). 

A  local  optimization  method  can  rapidly  find  a  set  of  parameter  values  that 
is  bad,  but  not  guaranteed  to  be  the  absolute  worst  possible.  If  a  local  optimiza¬ 
tion  method  finds  parameter  values  that  yield  unacceptable  performance,  it  has 
succeeded  in  determining  that  the  system  is  not  reliable.  But  a  local  optimization 
method  cannot  certify  the  system  as  reliable;  it  can  only  fail  to  find  bad  parameter 
values.  A  global  optimization  method,  in  contrast,  will  find  the  absolute  worst  val¬ 
ues  of  the  parameters,  and  if  the  associated  performance  is  acceptable,  can  certify 
the  system  as  safe.  The  cost  is  computation  time,  which  can  be  very  large,  even 
for  a  relatively  small  number  of  parameters.  But  it  may  be  worth  it  in  cases  where 
the  value  of  certifying  the  performance  is  high,  or  the  cost  of  being  wrong  about 
the  reliability  or  safety  is  high. 


1.4.3  Role  of  convex  optimization  in  nonconvex  problems 

In  this  book  we  focus  primarily  on  convex  optimization  problems,  and  applications 
that  can  be  reduced  to  convex  optimization  problems.  But  convex  optimization 
also  plays  an  important  role  in  problems  that  are  not  convex. 


1.5  Outline 


11 


Initialization  for  local  optimization 

One  obvious  use  is  to  combine  convex  optimization  with  a  local  optimization 
method.  Starting  with  a  nonconvex  problem,  we  first  find  an  approximate,  but 
convex,  formulation  of  the  problem.  By  solving  this  approximate  problem,  which 
can  be  done  easily  and  without  an  initial  guess,  we  obtain  the  exact  solution  to  the 
approximate  convex  problem.  This  point  is  then  used  as  the  starting  point  for  a 
local  optimization  method,  applied  to  the  original  nonconvex  problem. 

Convex  heuristics  for  nonconvex  optimization 

Convex  optimization  is  the  basis  for  several  heuristics  for  solving  nonconvex  prob¬ 
lems.  One  interesting  example  we  will  see  is  the  problem  of  finding  a  sparse  vector 
x  ( i.e .,  one  with  few  nonzero  entries)  that  satisfies  some  constraints.  While  this  is 
a  difficult  combinatorial  problem,  there  are  some  simple  heuristics,  based  on  con¬ 
vex  optimization,  that  often  find  fairly  sparse  solutions.  (These  are  described  in 
chapter  6.) 

Another  broad  example  is  given  by  randomized  algorithms,  in  which  an  ap¬ 
proximate  solution  to  a  nonconvex  problem  is  found  by  drawing  some  number  of 
candidates  from  a  probability  distribution,  and  taking  the  best  one  found  as  the 
approximate  solution.  Now  suppose  the  family  of  distributions  from  which  we  will 
draw  the  candidates  is  parametrized,  e.g.,  by  its  mean  and  covariance.  We  can  then 
pose  the  question,  which  of  these  distributions  gives  us  the  smallest  expected  value 
of  the  objective?  It  turns  out  that  this  problem  is  sometimes  a  convex  problem, 
and  therefore  efficiently  solved.  (See,  e.g.,  exercise  11.23.) 

Bounds  for  global  optimization 

Many  methods  for  global  optimization  require  a  cheaply  computable  lower  bound 
on  the  optimal  value  of  the  nonconvex  problem.  Two  standard  methods  for  doing 
this  are  based  on  convex  optimization.  In  relaxation,  each  nonconvex  constraint 
is  replaced  with  a  looser,  but  convex,  constraint.  In  Lagrangian  relaxation,  the 
Lagrangian  dual  problem  (described  in  chapter  5)  is  solved.  This  problem  is  convex, 
and  provides  a  lower  bound  on  the  optimal  value  of  the  nonconvex  problem. 


1.5  Outline 

The  book  is  divided  into  three  main  parts,  titled  Theory,  Applications,  and  Algo¬ 
rithms. 


1.5.1  Part  I:  Theory 

In  part  I,  Theory,  we  cover  basic  definitions,  concepts,  and  results  from  convex 
analysis  and  convex  optimization.  We  make  no  attempt  to  be  encyclopedic,  and 
skew  our  selection  of  topics  toward  those  that  we  think  are  useful  in  recognizing 
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and  formulating  convex  optimization  problems.  This  is  classical  material,  almost 
all  of  which  can  be  found  in  other  texts  on  convex  analysis  and  optimization.  We 
make  no  attempt  to  give  the  most  general  form  of  the  results;  for  that  the  reader 
can  refer  to  any  of  the  standard  texts  on  convex  analysis. 

Chapters  2  and  3  cover  convex  sets  and  convex  functions,  respectively.  We 
give  some  common  examples  of  convex  sets  and  functions,  as  well  as  a  number  of 
convex  calculus  rules,  i.e.,  operations  on  sets  and  functions  that  preserve  convexity. 
Combining  the  basic  examples  with  the  convex  calculus  rules  allows  us  to  form 
(or  perhaps  more  importantly,  recognize)  some  fairly  complicated  convex  sets  and 
functions. 

In  chapter  4,  Convex  optimization  problems,  we  give  a  careful  treatment  of  op¬ 
timization  problems,  and  describe  a  number  of  transformations  that  can  be  used  to 
reformulate  problems.  We  also  introduce  some  common  subclasses  of  convex  opti¬ 
mization,  such  as  linear  programming  and  geometric  programming,  and  the  more 
recently  developed  second-order  cone  programming  and  semidefinite  programming. 

Chapter  5  covers  Lagrangian  duality,  which  plays  a  central  role  in  convex  opti¬ 
mization.  Here  we  give  the  classical  Karush-Kuhn-Tucker  conditions  for  optimality, 
and  a  local  and  global  sensitivity  analysis  for  convex  optimization  problems. 


1.5.2  Part  II:  Applications 

In  part  II,  Applications ,  we  describe  a  variety  of  applications  of  convex  optimization, 
in  areas  like  probability  and  statistics,  computational  geometry,  and  data  fitting. 
We  have  described  these  applications  in  a  way  that  is  accessible,  we  hope,  to  a  broad 
audience.  To  keep  each  application  short,  we  consider  only  simple  cases,  sometimes 
adding  comments  about  possible  extensions.  We  are  sure  that  our  treatment  of 
some  of  the  applications  will  cause  experts  to  cringe,  and  we  apologize  to  them 
in  advance.  But  our  goal  is  to  convey  the  flavor  of  the  application,  quickly  and 
to  a  broad  audience,  and  not  to  give  an  elegant,  theoretically  sound,  or  complete 
treatment.  Our  own  backgrounds  are  in  electrical  engineering,  in  areas  like  control 
systems,  signal  processing,  and  circuit  analysis  and  design.  Although  we  include 
these  topics  in  the  courses  we  teach  (using  this  book  as  the  main  text),  only  a  few 
of  these  applications  are  broadly  enough  accessible  to  be  included  here. 

The  aim  of  part  II  is  to  show  the  reader,  by  example,  how  convex  optimization 
can  be  applied  in  practice. 


1.5.3  Part  III:  Algorithms 

In  part  III,  Algorithms,  we  describe  numerical  methods  for  solving  convex  opti¬ 
mization  problems,  focusing  on  Newton’s  algorithm  and  interior-point  methods. 
Part  III  is  organized  as  three  chapters,  which  cover  unconstrained  optimization, 
equality  constrained  optimization,  and  inequality  constrained  optimization,  respec¬ 
tively.  These  chapters  follow  a  natural  hierarchy,  in  which  solving  a  problem  is 
reduced  to  solving  a  sequence  of  simpler  problems.  Quadratic  optimization  prob¬ 
lems  (including,  e.g.,  least-squares)  form  the  base  of  the  hierarchy;  they  can  be 
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solved  exactly  by  solving  a  set  of  linear  equations.  Newton’s  method,  developed  in 
chapters  9  and  10,  is  the  next  level  in  the  hierarchy.  In  Newton’s  method,  solving 
an  unconstrained  or  equality  constrained  problem  is  reduced  to  solving  a  sequence 
of  quadratic  problems.  In  chapter  11,  we  describe  interior-point  methods,  which 
form  the  top  level  of  the  hierarchy.  These  methods  solve  an  inequality  constrained 
problem  by  solving  a  sequence  of  unconstrained,  or  equality  constrained,  problems. 

Overall  we  cover  just  a  handful  of  algorithms,  and  omit  entire  classes  of  good 
methods,  such  as  quasi-Newton,  conjugate-gradient,  bundle,  and  cutting-plane  al¬ 
gorithms.  For  the  methods  we  do  describe,  we  give  simplified  variants,  and  not  the 
latest,  most  sophisticated  versions.  Our  choice  of  algorithms  was  guided  by  several 
criteria.  We  chose  algorithms  that  are  simple  (to  describe  and  implement),  but 
also  reliable  and  robust,  and  effective  and  fast  enough  for  most  problems. 

Many  users  of  convex  optimization  end  up  using  (but  not  developing)  standard 
software,  such  as  a  linear  or  semidefinite  programming  solver.  For  these  users,  the 
material  in  part  III  is  meant  to  convey  the  basic  flavor  of  the  methods,  and  give 
some  ideas  of  their  basic  attributes.  For  those  few  who  will  end  up  developing  new 
algorithms,  we  think  that  part  III  serves  as  a  good  introduction. 


1.5.4  Appendices 

There  are  three  appendices.  The  first  lists  some  basic  facts  from  mathematics  that 
we  use,  and  serves  the  secondary  purpose  of  setting  out  our  notation.  The  second 
appendix  covers  a  fairly  particular  topic,  optimization  problems  with  quadratic 
objective  and  one  quadratic  constraint.  These  are  nonconvex  problems  that  never¬ 
theless  can  be  effectively  solved,  and  we  use  the  results  in  several  of  the  applications 
described  in  part  II. 

The  final  appendix  gives  a  brief  introduction  to  numerical  linear  algebra,  con¬ 
centrating  on  methods  that  can  exploit  problem  structure,  such  as  sparsity,  to  gain 
efficiency.  We  do  not  cover  a  number  of  important  topics,  including  roundoff  analy¬ 
sis,  or  give  any  details  of  the  methods  used  to  carry  out  the  required  factorizations. 
These  topics  are  covered  by  a  number  of  excellent  texts. 


1.5.5  Comments  on  examples 

In  many  places  in  the  text  (but  particularly  in  parts  II  and  III,  which  cover  ap¬ 
plications  and  algorithms,  respectively)  we  illustrate  ideas  using  specific  examples. 
In  some  cases,  the  examples  are  chosen  (or  designed)  specifically  to  illustrate  our 
point;  in  other  cases,  the  examples  are  chosen  to  be  ‘typical’.  This  means  that  the 
examples  were  chosen  as  samples  from  some  obvious  or  simple  probability  distri¬ 
bution.  The  dangers  of  drawing  conclusions  about  algorithm  performance  from  a 
few  tens  or  hundreds  of  randomly  generated  examples  are  well  known,  so  we  will 
not  repeat  them  here.  These  examples  are  meant  only  to  give  a  rough  idea  of  al¬ 
gorithm  performance,  or  a  rough  idea  of  how  the  computational  effort  varies  with 
problem  dimensions,  and  not  as  accurate  predictors  of  algorithm  performance.  In 
particular,  your  results  may  vary  from  ours. 
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1.5.6  Comments  on  exercises 

Each  chapter  concludes  with  a  set  of  exercises.  Some  involve  working  out  the  de¬ 
tails  of  an  argument  or  claim  made  in  the  text.  Others  focus  on  determining,  or 
establishing,  convexity  of  some  given  sets,  functions,  or  problems;  or  more  gener¬ 
ally,  convex  optimization  problem  formulation.  Some  chapters  include  numerical 
exercises,  which  require  some  (but  not  much)  programming  in  an  appropriate  high 
level  language.  The  difficulty  level  of  the  exercises  is  mixed,  and  varies  without 
warning  from  quite  straightforward  to  rather  tricky. 


1.6  Notation 


Our  notation  is  more  or  less  standard,  with  a  few  exceptions.  In  this  section  we 
describe  our  basic  notation;  a  more  complete  list  appears  on  page  697. 

We  use  R  to  denote  the  set  of  real  numbers,  R+  to  denote  the  set  of  nonnegative 
real  numbers,  and  R++  to  denote  the  set  of  positive  real  numbers.  The  set  of  real 
n- vectors  is  denoted  R",  and  the  set  of  real  mx  n  matrices  is  denoted  Rmxn.  We 
delimit  vectors  and  matrices  with  square  brackets,  with  the  components  separated 
by  space.  We  use  parentheses  to  construct  column  vectors  from  comma  separated 
lists.  For  example,  if  a,  b,  c£  R,  we  have 


(a,b,c)  = 


a 

b 


c 


a  b  c  }T , 


which  is  an  element  of  Rli.  The  symbol  1  denotes  a  vector  all  of  whose  components 
are  one  (with  dimension  determined  from  context).  The  notation  Xi  can  refer  to 
the  «th  component  of  the  vector  x,  or  to  the  ith  element  of  a  set  or  sequence  of 
vectors  X\,X2,  ■  ■  ■■  The  context,  or  the  text,  makes  it  clear  which  is  meant. 

We  use  Sfe  to  denote  the  set  of  symmetric  k  x  k  matrices,  S+  to  denote  the 
set  of  symmetric  positive  semidefinite  k  x  k  matrices,  and  S++  to  denote  the  set 
of  symmetric  positive  definite  k  x  k  matrices.  The  curled  inequality  symbol  ^ 
(and  its  strict  form  >-)  is  used  to  denote  generalized  inequality:  between  vectors, 
it  represents  componentwise  inequality;  between  symmetric  matrices,  it  represents 
matrix  inequality.  With  a  subscript,  the  symbol  ~<k  (or  <k)  denotes  generalized 
inequality  with  respect  to  the  cone  K  (explained  in  §2.4.1). 

Our  notation  for  describing  functions  deviates  a  bit  from  standard  notation, 
but  we  hope  it  will  cause  no  confusion.  We  use  the  notation  f  :  Rp  — >  R9  to  mean 
that  /  is  an  R9- valued  function  on  some  subset  of  Rp,  specifically,  its  domain, 
which  we  denote  dom  /.  We  can  think  of  our  use  of  the  notation  /  :  Rp  — >  R9  as 
a  declaration  of  the  function  type,  as  in  a  computer  language:  f  :  Rp  — ►  R9  means 
that  the  function  /  takes  as  argument  a  real  p-vector,  and  returns  a  real  g-vector. 
The  set  dom/,  the  domain  of  the  function  /,  specifies  the  subset  of  Rp  of  points 
x  for  which  f{x)  is  defined.  As  an  example,  we  describe  the  logarithm  function 
as  log  :  R  — >  R,  with  dom  log  =  R++.  The  notation  log  :  R  — >  R  means  that 
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the  logarithm  function  accepts  and  returns  a  real  number;  dom  log  =  R++  means 
that  the  logarithm  is  defined  only  for  positive  numbers. 

We  use  R"  as  a  generic  finite-dimensional  vector  space.  We  will  encounter 
several  other  finite-dimensional  vector  spaces,  e.g.,  the  space  of  polynomials  of  a 
variable  with  a  given  maximum  degree,  or  the  space  Sfe  of  symmetric  k  x  k  matrices. 
By  identifying  a  basis  for  a  vector  space,  we  can  always  identify  it  with  R"  (where 
n  is  its  dimension),  and  therefore  the  generic  results,  stated  for  the  vector  space 
R",  can  be  applied.  We  usually  leave  it  to  the  reader  to  translate  general  results 
or  statements  to  other  vector  spaces.  For  example,  any  linear  function  /  :  R”  — >  R 
can  be  represented  in  the  form  f(x)  =  cTx,  where  c  £  Rn.  The  corresponding 
statement  for  the  vector  space  can  be  found  by  choosing  a  basis  and  translating. 
This  results  in  the  statement:  any  linear  function  /  :  — >  R  can  be  represented 
in  the  form  f(X)  =  tr(CX),  where  C  £  Sk. 
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Gauss  in  the  1820s,  and  recently  translated  by  Stewart  [Gau95].  More  recent  work  in¬ 
cludes  the  books  by  Lawson  and  Hanson  [LH95]  and  Bjorck  [Bjo96] .  References  on  linear 
programming  can  be  found  in  chapter  4. 

There  are  many  good  texts  on  local  methods  for  nonlinear  programming,  including  Gill, 
Murray,  and  Wright  [GMW81],  Nocedal  and  Wright  [NW99],  Luenberger  [Lue84],  and 
Bertsekas  [Ber99]. 

Global  optimization  is  covered  in  the  books  by  Horst  and  Pardalos  [HP94],  Pinter  [Pin95], 
and  Tuy  [Tuy98].  Using  convex  optimization  to  find  bounds  for  nonconvex  problems  is 
an  active  research  topic,  and  addressed  in  the  books  above  on  global  optimization,  the 
book  by  Bcn-Tal  and  Nemirovski  [BTN01,  §4.3],  and  the  survey  by  Nesterov,  Wolkowicz, 
and  Ye  [NWY00].  Some  notable  papers  on  this  subject  are  Goemans  and  Williamson 
[GW95],  Nesterov  [NesOO,  Nes98],  Ye  [Ye99],  and  Parrilo  [Par03].  Randomized  methods 
are  discussed  in  Motwani  and  Raghavan  [MR95]. 

Convex  analysis,  the  mathematics  of  convex  sets,  functions,  and  optimization  problems,  is 
a  well  developed  subfield  of  mathematics.  Basic  references  include  the  books  by  Rockafel- 
lar  [Roc70],  Hiriart-Urruty  and  Lemarechal  [HUL93,  HUL01],  Borwein  and  Lewis  [BL00], 
and  Bertsekas,  Nedic,  and  Ozdaglar  [Ber03].  More  references  on  convex  analysis  can  be 
found  in  chapters  2-5. 

Nesterov  and  Nemirovski  [NN94]  were  the  first  to  point  out  that  interior-point  methods 
can  solve  many  convex  optimization  problems;  see  also  the  references  in  chapter  11.  The 
book  by  Ben-Tal  and  Nemirovski  [BTN01]  covers  modern  convex  optimization,  interior- 
point  methods,  and  applications. 

Solution  methods  for  convex  optimization  that  we  do  not  cover  in  this  book  include 
subgradient  methods  [Sho85],  bundle  methods  [HUL93],  cutting-plane  methods  [KelGO, 
EM75,  GLY96],  and  the  ellipsoid  method  [Sho91,  BGT81]. 

The  idea  that  convex  optimization  problems  are  tractable  is  not  new.  It  has  long  been  rec¬ 
ognized  that  the  theory  of  convex  optimization  is  far  more  straightforward  (and  complete) 
than  the  theory  of  general  nonlinear  optimization.  In  this  context  Rockafellar  stated,  in 
his  1993  SIAM  Review  survey  paper  [Roc93], 

In  fact  the  great  watershed  in  optimization  isn’t  between  linearity  and  nonlin¬ 
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e  >  0  is  the  required  accuracy.  (We  will  see  some  simple  results  like  these  in  chapter  11.) 
The  first  comprehensive  work  on  this  topic  is  the  book  by  Nesterov  and  Nemirovski 
[NN94].  Other  books  include  Ben-Tal  and  Nemirovski  [BTN01,  lecture  5]  and  Renegar 
[RenOl].  The  polynomial-time  complexity  of  interior-point  methods  for  various  convex 
optimization  problems  is  in  marked  contrast  to  the  situation  for  a  number  of  nonconvex 
optimization  problems,  for  which  all  known  algorithms  require,  in  the  worst  case,  a  number 
of  operations  that  is  exponential  in  the  problem  dimensions. 
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Convex  optimization  has  been  used  in  many  applications  areas,  too  numerous  to  cite 
here.  Convex  analysis  is  central  in  economics  and  finance,  where  it  is  the  basis  of  many 
results.  For  example  the  separating  hyperplane  theorem,  together  with  a  no-arbitrage 
assumption,  is  used  to  deduce  the  existence  of  prices  and  risk-neutral  probabilities  (see, 
e.g.,  Luenberger  [Lue95,  Lue98]  and  Ross  [Ros99]).  Convex  optimization,  especially  our 
ability  to  solve  semidefinite  programs,  has  recently  received  particular  attention  in  au¬ 
tomatic  control  theory.  Applications  of  convex  optimization  in  control  theory  can  be 
found  in  the  books  by  Boyd  and  Barratt  [BB91],  Boyd,  El  Ghaoui,  Feron,  and  Balakrish- 
nan  [BEFB94],  Dahleh  and  Diaz-Bobillo  [DDB95],  El  Ghaoui  and  Niculescu  [ENOO],  and 
Dullerud  and  Paganini  [DPOO].  A  good  example  of  embedded  (convex)  optimization  is 
model  predictive  control,  an  automatic  control  technique  that  requires  the  solution  of  a 
(convex)  quadratic  program  at  each  step.  Model  predictive  control  is  now  widely  used  in 
the  chemical  process  control  industry;  see  Morari  and  Zafirou  [MZ89].  Another  applica¬ 
tions  area  where  convex  optimization  (and  especially,  geometric  programming)  has  a  long 
history  is  electronic  circuit  design.  Research  papers  on  this  topic  include  Fishburn  and 
Dunlop  [FD85],  Sapatnekar,  Rao,  Vaidya,  and  Kang  [SRVK93],  and  Hershenson,  Boyd, 
and  Lee  [HBL01].  Luo  [Luo03]  gives  a  survey  of  applications  in  signal  processing  and 
communications.  More  references  on  applications  of  convex  optimization  can  be  found  in 
chapters  4  and  6-8. 

High  quality  implementations  of  recent  interior-point  methods  for  convex  optimization 
problems  are  available  in  the  LOQO  [Van97]  and  MOSEK  [MOS02]  software  packages, 
and  the  codes  listed  in  chapter  11.  Software  systems  for  specifying  optimization  prob¬ 
lems  include  AMPL  [FGK99]  and  GAMS  [BKMR98].  Both  provide  some  support  for 
recognizing  problems  that  can  be  transformed  to  linear  programs. 


Part  I 

Theory 


Chapter  2 

Convex  sets 


2.1  Affine  and  convex  sets 

2.1.1  Lines  and  line  segments 

Suppose  Xi  ^  X2  are  two  points  in  Rn.  Points  of  the  form 

y  =  9x i  +  (1  -  9)x 2, 

where  9  £  R,  form  the  line  passing  through  Xi  and  X2 ■  The  parameter  value  9  =  0 
corresponds  to  y  =  X2,  and  the  parameter  value  9  =  1  corresponds  to  y  =  X\ . 
Values  of  the  parameter  9  between  0  and  1  correspond  to  the  (closed)  line  segment 
between  X\  and  X2- 

Expressing  y  in  the  form 

y  =  x2  +  9{x i  -  x2) 

gives  another  interpretation:  y  is  the  sum  of  the  base  point  X2  (corresponding 
to  9  =  0)  and  the  direction  Xi  —  X2  (which  points  from  X2  to  xi)  scaled  by  the 
parameter  9.  Thus,  9  gives  the  fraction  of  the  way  from  X2  to  x\  where  y  lies.  As 
9  increases  from  0  to  1,  the  point  y  moves  from  X2  to  X\ ;  for  9  >  1,  the  point  y  lies 
on  the  line  beyond  x\.  This  is  illustrated  in  figure  2.1. 

2.1.2  Affine  sets 

A  set  C  C  R"  is  affine  if  the  line  through  any  two  distinct  points  in  C  lies  in  C, 
i.e.,  if  for  any  xi,  X2  £  C  and  9  G  R,  we  have  9x\  +  (1  —  9)x 2  £  C.  In  other  words, 
C  contains  the  linear  combination  of  any  two  points  in  C,  provided  the  coefficients 
in  the  linear  combination  sum  to  one. 

This  idea  can  be  generalized  to  more  than  two  points.  We  refer  to  a  point 
of  the  form  9\X\  +  •  •  •  +  9k.Xk,  where  9\  +  •  •  •  +  9k  =  1,  as  an  affine  combination 
of  the  points  xi,  ...,  Xk ■  Using  induction  from  the  definition  of  affine  set  ( i.e ., 
that  it  contains  every  affine  combination  of  two  points  in  it),  it  can  be  shown  that 
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Figure  2.1  The  line  passing  through  xi  and  X2  is  described  parametrically 
by  6x i  +  (1  —  d)x 2,  where  9  varies  over  R.  The  line  segment  between  xi  and 
X2,  which  corresponds  to  9  between  0  and  1,  is  shown  darker. 


an  affine  set  contains  every  affine  combination  of  its  points:  If  C  is  an  affine  set, 

Xi, . . . ,  Xk  £  C,  and  d±-\ - b  Ok  =  1,  then  the  point  9\X\  +  ■  — F 9kxk  also  belongs 

to  C. 

If  C  is  an  affine  set  and  Xq  £  C,  then  the  set 

V  =  C  —  Xo  =  {x  —  Xo  |  x  £  C} 

is  a  subspace,  i.e.,  closed  under  sums  and  scalar  multiplication.  To  see  this,  suppose 
Vi,  V2  £  V  and  a ,  /?  £  R.  Then  we  have  v\  +  Xq  £  C  and  v2  +  xq  £  C,  and  so 

av i  +  (3v2  +  x0  =  a(vi  +  x0)  +  /3(v2  +  x0)  +  (1  -  a  -  /3)x0  £  C, 

since  C  is  affine,  and  a  +  (3  +  (1  —  a  —  /?)  =  1.  We  conclude  that  av\  +  /3v 2  £  V, 
since  av±  +  (3v2  +  xo  £  C. 

Thus,  the  affine  set  C  can  be  expressed  as 

C  =  V  +  xo  =  {u  +  x0  |  v  £  V}, 

i.e.,  as  a  subspace  plus  an  offset.  The  subspace  V  associated  with  the  affine  set  C 
does  not  depend  on  the  choice  of  Xq,  so  Xo  can  be  chosen  as  any  point  in  C.  We 
define  the  dimension  of  an  affine  set  C  as  the  dimension  of  the  subspace  V  =  C—x o, 
where  Xq  is  any  element  of  C . 


Example  2.1  Solution  set  of  linear  equations.  The  solution  set  of  a  system  of  linear 
equations,  C  =  {x  \  Ax  =  6},  where  A  £  Rmx"  and  b  £  Rm,  is  an  affine  set.  To 
show  this,  suppose  x\,  x2  £  C,  i.e.,  Axi  =  b,  Ax 2  =  b.  Then  for  any  6,  we  have 

A(9x  1  +  (1  —  9)x  2)  =  6  Ax  1  +  (1  —  9)  Ax  2 
=  (96  + (1-0)6 


which  shows  that  the  affine  combination  9x  1  +  (1  —  9)x 2  is  also  in  C.  The  subspace 
associated  with  the  affine  set  C  is  the  nullspace  of  A. 

We  also  have  a  converse:  every  affine  set  can  be  expressed  as  the  solution  set  of  a 
system  of  linear  equations. 


2.1  Affine  and  convex  sets 
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The  set  of  all  affine  combinations  of  points  in  some  set  C  C  R"  is  called  the 
affine  hull  of  C,  and  denoted  aff  C : 


aff  C  —  ffiiXi  +  •  •  •  +  @k%k  j  *^15  •  *  ■  5  %k  £  C,  l^it  —  1}. 

The  affine  hull  is  the  smallest  affine  set  that  contains  C,  in  the  following  sense:  if 
S  is  any  affine  set  with  CCS,  then  aff  CCS. 


2.1.3  Affine  dimension  and  relative  interior 

We  define  the  affine  dimension  of  a  set  C  as  the  dimension  of  its  affine  hull.  Affine 
dimension  is  useful  in  the  context  of  convex  analysis  and  optimization,  but  is  not 
always  consistent  with  other  definitions  of  dimension.  As  an  example  consider  the 
unit  circle  in  R2,  i.e.,  {a:  £  R2  |  x\  +  x\  =  1}.  Its  affine  hull  is  all  of  R2,  so  its 
affine  dimension  is  two.  By  most  definitions  of  dimension,  however,  the  unit  circle 
in  R2  has  dimension  one. 

If  the  affine  dimension  of  a  set  C  C  Rn  is  less  than  n,  then  the  set  lies  in 
the  affine  set  aff  C  ^  Rn.  We  define  the  relative  interior  of  the  set  C,  denoted 
relint  C,  as  its  interior  relative  to  aff  C: 

relint  C  =  {x  £  C  \  B(x,  r)  ft  aff  C  C  C  for  some  r  >  0}, 

where  B(x,r)  =  {y  \  \\y  —  x\\  <  r},  the  ball  of  radius  r  and  center  x  in  the  norm 
||  •  || .  (Here  ||  •  ||  is  any  norm;  all  norms  define  the  same  relative  interior.)  We  can 
then  define  the  relative  boundary  of  a  set  C  as  cl  C  \  relint  C,  where  cl  C  is  the 
closure  of  C. 


Example  2.2  Consider  a  square  in  the  (xi,  X2)-plane  in  R3,  defined  as 

C  =  {x  £  R3  |  —  1  <  xi  <  1,  —  1  <  X2  <)  1,  X3  e=  0}. 

Its  affine  hull  is  the  (xi,  X2)-plane,  i.e.,  aff  C  =  (x  £  R3  |  X3  =  0}.  The  interior  of  C 
is  empty,  but  the  relative  interior  is 

relint  C  =  (x  £  R3  |  —  1  <  xi  <  1,  —  1  <  X2  <  1,  X3  =>  0}. 

Its  boundary  (in  R3)  is  itself;  its  relative  boundary  is  the  wire-frame  outline, 

(x  £  R3  |  max{|xi|,  |x2|}  =  1,  X3  =  0}. 


2.1.4  Convex  sets 

A  set  C  is  convex  if  the  line  segment  between  any  two  points  in  C  lies  in  C,  i.e., 
if  for  any  x±,  X2  £  C  and  any  9  with  0  <  6  <  1,  we  have 


9x  1  +  (1  —  9)x 2  £  C. 
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Figure  2.2  Some  simple  convex  and  nonconvex  sets.  Left.  The  hexagon, 
which  includes  its  boundary  (shown  darker),  is  convex.  Middle.  The  kidney 
shaped  set  is  not  convex,  since  the  line  segment  between  the  two  points  in 
the  set  shown  as  dots  is  not  contained  in  the  set.  Right.  The  square  contains 
some  boundary  points  but  not  others,  and  is  not  convex. 


Figure  2.3  The  convex  hulls  of  two  sets  in  R2.  Left.  The  convex  hull  of  a 
set  of  fifteen  points  (shown  as  dots)  is  the  pentagon  (shown  shaded).  Right. 
The  convex  hull  of  the  kidney  shaped  set  in  figure  2.2  is  the  shaded  set. 


Roughly  speaking,  a  set  is  convex  if  every  point  in  the  set  can  be  seen  by  every  other 
point,  along  an  unobstructed  straight  path  between  them,  where  unobstructed 
means  lying  in  the  set.  Every  affine  set  is  also  convex,  since  it  contains  the  entire 
line  between  any  two  distinct  points  in  it,  and  therefore  also  the  line  segment 
between  the  points.  Figure  2.2  illustrates  some  simple  convex  and  nonconvex  sets 
in  R2. 

We  call  a  point  of  the  form  0ia;i  +  •  •  •  +  0kxk ,  where  0i  +  •  ■  ■  +  0fc  =  1  and 
Oi  >  0,  i  =  1, . . . ,  fc,  a  convex  combination  of  the  points  X\,  . . . ,  Xk ■  As  with  affine 
sets,  it  can  be  shown  that  a  set  is  convex  if  and  only  if  it  contains  every  convex 
combination  of  its  points.  A  convex  combination  of  points  can  be  thought  of  as  a 
mixture  or  weighted  average  of  the  points,  with  Oi  the  fraction  of  Xi  in  the  mixture. 

The  convex  hull  of  a  set  C,  denoted  conv  C,  is  the  set  of  all  convex  combinations 
of  points  in  C: 

conv  C  =  {0i .ti  H - b  OkXk  |  Xi  G  C,  0j  >  0,  i  =  1, . . . ,  k,  0i  d - +  0k  =  1}. 

As  the  name  suggests,  the  convex  hull  conv  C  is  always  convex.  It  is  the  smallest 
convex  set  that  contains  C:  If  B  is  any  convex  set  that  contains  C ,  then  conv  C  C 
B.  Figure  2.3  illustrates  the  definition  of  convex  hull. 

The  idea  of  a  convex  combination  can  be  generalized  to  include  infinite  sums,  in¬ 
tegrals,  and,  in  the  most  general  form,  probability  distributions.  Suppose  0i,  02, . . . 
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satisfy 

OO 

0t>  o,  i  =  1,2,...,  ^  =  1, 

i=l 

and  x\,  X2,  ■  ■  ■  £  C ,  where  C  C  R"  is  convex.  Then 


€  C, 

2=1 

if  the  series  converges.  More  generally,  suppose  p  :  R"  — >■  R  satisfies  p(;r)  >  0  for 
all  x  £  C  and  fcp(x)  dx  =  1,  where  C  C  R"  is  convex.  Then 


p{x)x  dx  £  C, 


lc 


if  the  integral  exists. 

In  the  most  general  form,  suppose  C  C  R"  is  convex  and  a;  is  a  random  vector 
with  x  £  C  with  probability  one.  Then  E  x  £  C.  Indeed,  this  form  includes  all 
the  others  as  special  cases.  For  example,  suppose  the  random  variable  x  only  takes 
on  the  two  values  x\  and  x?. ,  with  prob(.T  =  X\)  =  9  and  prob(cr  =  X2)  =  1  —  6, 
where  0  <  9  <  1.  Then  Ex  =  9x  1  +  (1  —  9)x 2,  and  we  are  back  to  a  simple  convex 
combination  of  two  points. 


2.1.5  Cones 

A  set  C  is  called  a  cone,  or  nonnegative  homogeneous,  if  for  every  x  £  C  and  9  >  0 
we  have  9x  £  C.  A  set  C  is  a  convex  cone  if  it  is  convex  and  a  cone,  which  means 
that  for  any  X\,  X2  £  C  and  9 1,  62  >  0,  we  have 


9 1X1  +  92X2  £  C. 


Points  of  this  form  can  be  described  geometrically  as  forming  the  two-dimensional 
pie  slice  with  apex  0  and  edges  passing  through  X\  and  X2 ■  (See  figure  2.4.) 

A  point  of  the  form  9\X\  +  ■  ■  ■  +  9kXk  with  9\, . . .  ,9k  >  0  is  called  a  conic 
combination  (or  a  nonnegative  linear  combination)  of  x\ ,...,Xk-  If  Xi  are  in  a 
convex  cone  C ,  then  every  conic  combination  of  Xi  is  in  C .  Conversely,  a  set  C  is 
a  convex  cone  if  and  only  if  it  contains  all  conic  combinations  of  its  elements.  Like 
convex  (or  affine)  combinations,  the  idea  of  conic  combination  can  be  generalized 
to  infinite  sums  and  integrals. 

The  conic  hull  of  a  set  C  is  the  set  of  all  conic  combinations  of  points  in  C ,  i.e., 


{9ix  1  T  *  *  *  T  9kXk  \  Xi  £  C,  9i  ^  0,  i  1, . . . , 


which  is  also  the  smallest  convex  cone  that  contains  C  (see  figure  2.5). 
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Figure  2.4  The  pie  slice  shows  all  points  of  the  form  6*1*1  +  62X2,  where 
81,  82  >  0.  The  apex  of  the  slice  (which  corresponds  to  9\  =  82  =  0)  is  at 
0;  its  edges  (which  correspond  to  8\  =  0  or  82  =  0)  pass  through  the  points 
*1  and  *2- 


Figure  2.5  The  conic  hulls  (shown  shaded)  of  the  two  sets  of  figure  2.3. 


2.2  Some  important  examples 
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2.2  Some  important  examples 

In  this  section  we  describe  some  important  examples  of  convex  sets  which  we  will 
encounter  throughout  the  rest  of  the  book.  We  start  with  some  simple  examples. 

•  The  empty  set  0,  any  single  point  (i.e.,  singleton)  {a.’o},  and  the  whole  space 
R"  are  affine  (hence,  convex)  subsets  of  R". 

•  Any  line  is  affine.  If  it  passes  through  zero,  it  is  a  subspace,  hence  also  a 
convex  cone. 

•  A  line  segment  is  convex,  but  not  affine  (unless  it  reduces  to  a  point). 

•  A  ray ,  which  has  the  form  {so  +  9v  |  8  >  0},  where  v  ^  0,  is  convex,  but  not 
affine.  It  is  a  convex  cone  if  its  base  xq  is  0. 

•  Any  subspace  is  affine,  and  a  convex  cone  (hence  convex). 


2.2.1  Hyperplanes  and  halfspaces 

A  hyperplane  is  a  set  of  the  form 

{x  |  aTx  =  b}, 

where  a  €  Rn,  a  /  0,  and  b  £  R.  Analytically  it  is  the  solution  set  of  a  nontrivial 
linear  equation  among  the  components  of  x  (and  hence  an  affine  set).  Geometri¬ 
cally,  the  hyperplane  {x  \  aT x  =  b}  can  be  interpreted  as  the  set  of  points  with  a 
constant  inner  product  to  a  given  vector  a,  or  as  a  hyperplane  with  normal  vector 
a ;  the  constant  b  £  R  determines  the  offset  of  the  hyperplane  from  the  origin.  This 
geometric  interpretation  can  be  understood  by  expressing  the  hyperplane  in  the 
form 

{x  |  aT(x  -  x0)  =  0}, 

where  xq  is  any  point  in  the  hyperplane  (i.e.,  any  point  that  satisfies  aTx o  =  b). 
This  representation  can  in  turn  be  expressed  as 

{x  |  aT (x  —  Xq )  =  0}  =  Xo  +  a±, 

where  denotes  the  orthogonal  complement  of  a,  i.e.,  the  set  of  all  vectors  or¬ 
thogonal  to  it: 

a-1  =  {v  |  aTv  =  0}. 

This  shows  that  the  hyperplane  consists  of  an  offset  xq,  plus  all  vectors  orthog¬ 
onal  to  the  (normal)  vector  a.  These  geometric  interpretations  are  illustrated  in 
figure  2.6. 

A  hyperplane  divides  R"  into  two  halfspaces.  A  (closed)  halfspace  is  a  set  of 
the  form 


where  o  /  0,  i.e.,  the  solution  set  of  one  (nontrivial)  linear  inequality.  Halfspaces 
are  convex,  but  not  affine.  This  is  illustrated  in  figure  2.7. 
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Figure  2.6  Hyperplane  in  R2,  with  normal  vector  a  and  a  point  xo  in  the 
hyperplane.  For  any  point  x  in  the  hyperplane,  x  —  xo  (shown  as  the  darker 
arrow)  is  orthogonal  to  a. 


Figure  2.7  A  hyperplane  defined  by  aTx  =  b  in  R2  determines  two  halfspaces. 
The  halfspace  determined  by  aTx  >  b  (not  shaded)  is  the  halfspace  extending 
in  the  direction  a.  The  halfspace  determined  by  aTx  <  b  (which  is  shown 
shaded)  extends  in  the  direction  —a.  The  vector  a  is  the  outward  normal  of 
this  halfspace. 
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X\ 


Figure  2.8  The  shaded  set  is  the  halfspace  determined  by  aT(x  —  xo )  <  0. 
The  vector  x\  —  Xo  makes  an  acute  angle  with  a,  so  X\  is  not  in  the  halfspace. 
The  vector  X2  —  xo  makes  an  obtuse  angle  with  a,  and  so  is  in  the  halfspace. 


The  lialfspace  (2.1)  can  also  be  expressed  as 

{x  |  aT(x  —  Xq)  <  0},  (2.2) 

where  xq  is  any  point  on  the  associated  hyperplane,  i.e .,  satisfies  aTx o  =  b.  The 
representation  (2.2)  suggests  a  simple  geometric  interpretation:  the  halfspace  con¬ 
sists  of  xq  plus  any  vector  that  makes  an  obtuse  (or  right)  angle  with  the  (outward 
normal)  vector  a.  This  is  illustrated  in  figure  2.8. 

The  boundary  of  the  halfspace  (2.1)  is  the  hyperplane  {x  \  aTx  =  b}.  The  set 
{x  |  aTx  <  b},  which  is  the  interior  of  the  halfspace  {x  |  aTx  <  b},  is  called  an 
open  halfspace. 


2.2.2  Euclidean  balls  and  ellipsoids 

A  (Euclidean)  ball  (or  just  ball)  in  R"  has  the  form 

B(xc ,  r)  =  {x  |  \\x  -  xc\\2  <  r}  =  {x  \  (x  -  xc)T(x  -  xc)  <  r2}, 

where  r  >  0,  and  ||  •  ||2  denotes  the  Euclidean  norm,  i.e.,  ||w||2  =  {uT u )1/2.  The 
vector  xc  is  the  center  of  the  ball  and  the  scalar  r  is  its  radius ;  B(xc,r )  consists 
of  all  points  within  a  distance  r  of  the  center  xc.  Another  common  representation 
for  the  Euclidean  ball  is 


B(xc,  r)  =  {xc  +  ru  \  ||w||2  <  !}■ 
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Figure  2.9  An  ellipsoid  in  R2,  shown  shaded.  The  center  xc  is  shown  as  a 
dot,  and  the  two  semi-axes  are  shown  as  line  segments. 


A  Euclidean  ball  is  a  convex  set:  if  ||cci  —  xc||2  <  r,  ||x2  —  xc||2  <  r,  and 
0  <  0  <  1,  then 

||0xi  +  (1 -0)x2 -xc||2  =  \\0(xi  -  xc)  +  (1  -  0)(x2  -  xc)||2 

<  dWxx  -  xc\\2  +  (1  -  0)||x2  -  xch 

<  r. 

(Here  we  use  the  homogeneity  property  and  triangle  inequality  for  ||  •  ||2;  see  §A.1.2.) 
A  related  family  of  convex  sets  is  the  ellipsoids ,  which  have  the  form 

£  =  {x  |  {x  —  xc)TP~1(x  —  xc)  <  1},  (2.3) 

where  P  =  PT  >-  0,  i.e.,  P  is  symmetric  and  positive  definite.  The  vector  xc  G  R" 
is  the  center  of  the  ellipsoid.  The  matrix  P  determines  how  far  the  ellipsoid  extends 
in  every  direction  from  xc\  the  lengths  of  the  semi-axes  of  £  are  given  by  y/Xi,  where 
A i  are  the  eigenvalues  of  P.  A  ball  is  an  ellipsoid  with  P  =  r2I.  Figure  2.9  shows 
an  ellipsoid  in  R2. 

Another  common  representation  of  an  ellipsoid  is 

£  =  {xc  +  Au  |  ||u||2  <  1},  (2.4) 

where  A  is  square  and  nonsingular.  In  this  representation  we  can  assume  without 
loss  of  generality  that  A  is  symmetric  and  positive  definite.  By  taking  A  =  P 1/2, 
this  representation  gives  the  ellipsoid  defined  in  (2.3).  When  the  matrix  A  in  (2.4) 
is  symmetric  positive  semidefinite  but  singular,  the  set  in  (2.4)  is  called  a  degenerate 
ellipsoid ;  its  affine  dimension  is  equal  to  the  rank  of  A.  Degenerate  ellipsoids  are 
also  convex. 


2.2.3  Norm  balls  and  norm  cones 

Suppose  ||-||  is  any  norm  on  Rn  (see  §A.1.2).  From  the  general  properties  of  norms  it 
can  be  shown  that  a  norm  ball  of  radius  r  and  center  xc ,  given  by  {x  \  \\x— xc\\  <  r}, 
is  convex.  The  norm  cone  associated  with  the  norm  ||  •  ||  is  the  set 

C  =  {(x,  t)  |  ||x||  <  t}  C  R"+1. 


2.2  Some  important  examples 
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Figure  2.10  Boundary  of  second-order  cone  in  R3,  {(xi,X2  ,t)  \  {x\Jrx\)1^'1  < 
t}. 


It  is  (as  the  name  suggests)  a  convex  cone. 


Example  2.3  The  second-order  cone  is  the  norm  cone  for  the  Euclidean  norm,  i.e., 


C  =  {{x,  t)  £  R 


n+1  | 
X 

t 


rc||2  <  t} 

-i  T  r 

I  0 
0  -1 


<  0,  t  >  0 


The  second-order  cone  is  also  known  by  several  other  names.  It  is  called  the  quadratic 
cone,  since  it  is  defined  by  a  quadratic  inequality.  It  is  also  called  the  Lorentz  cone 
or  ice-cream  cone.  Figure  2.10  shows  the  second-order  cone  in  R3. 


2.2.4  Polyhedra 

A  polyhedron  is  defined  as  the  solution  set  of  a  finite  number  of  linear  equalities 
and  inequalities: 

V  =  {x\  ajx  <  bj,  j  =  1, . ..  ,m,  cjx  =  dj,  j  =  (2.5) 

A  polyhedron  is  thus  the  intersection  of  a  finite  number  of  halfspaces  and  hyper¬ 
planes.  Affine  sets  ( e.g subspaces,  hyperplanes,  lines),  rays,  line  segments,  and 
halfspaces  are  all  polyhedra.  It  is  easily  shown  that  polyhedra  are  convex  sets. 
A  bounded  polyhedron  is  sometimes  called  a  polytope ,  but  some  authors  use  the 
opposite  convention  {i.e.,  polytope  for  any  set  of  the  form  (2.5),  and  polyhedron 
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Figure  2.11  The  polyhedron  V  (shown  shaded)  is  the  intersection  of  five 
halfspaces,  with  outward  normal  vectors  ai, . . . . ,  as. 


when  it  is  bounded).  Figure  2.11  shows  an  example  of  a  polyhedron  defined  as  the 
intersection  of  five  halfspaces. 

It  will  be  convenient  to  use  the  compact  notation 

V  =  {x  |  Ax  A  b,  Cx  =  d}  (2-6) 


for  (2.5),  where 


i 

CS 

i _ 

i 

i _ 

A  = 

,  C  = 

T 

L  am  J 

and  the  symbol  A  denotes  vector  inequality  or  componentwise  inequality  in  Rm: 
u  A  v  means  Ui  <Vi  for  i  =  1 ,m. 


Example  2.4  The  nonnegative  orthant  is  the  set  of  points  with  nonnegative  compo¬ 
nents,  i.e., 


R"  =  {x  £  Rn  |  Xi  >  0,  i  —  1, . . .  ,n}  =  {x  £  R"  |  x  y  0}. 

(Here  R+  denotes  the  set  of  nonnegative  numbers:  R+  =  {x  £  R  |  x  >  0}.)  The 
nonnegative  orthant  is  a  polyhedron  and  a  cone  (and  therefore  called  a  polyhedral 
cone). 


Simplexes 

Simplexes  are  another  important  family  of  polyhedra.  Suppose  the  k  +  1  points 
vo,...,Vk  £  Rra  are  affinely  independent ,  which  means  v\  —  vq,  . . .  ,vk  —  Vo  are 
linearly  independent.  The  simplex  determined  by  them  is  given  by 

C  =  conv{u0,  ...,vk}  =  {O0v0  d - b  0kvk  \  0  >z  0,  1T0  =  1},  (2.7) 


2.2  Some  important  examples 
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where  1  denotes  the  vector  with  all  entries  one.  The  affine  dimension  of  this  simplex 
is  k ,  so  it  is  sometimes  referred  to  as  a  fc-dimensional  simplex  in  R™. 


Example  2.5  Some  common  simplexes.  A  1-dimensional  simplex  is  a  line  segment; 
a  2-dimensional  simplex  is  a  triangle  (including  its  interior);  and  a  3-dimensional 
simplex  is  a  tetrahedron. 

The  unit  simplex  is  the  n-dimensional  simplex  determined  by  the  zero  vector  and  the 
unit  vectors,  i.e.,  0,  ei, . . . ,  en  £  R™.  It  can  be  expressed  as  the  set  of  vectors  that 
satisfy 

x  >:  0,  lTx  <  1. 

The  probability  simplex  is  the  (n  —  l)-dimensional  simplex  determined  by  the  unit 
vectors  ei, ...  ,en  £  R" •  It  is  the  set  of  vectors  that  satisfy 

x  y  0,  1T x  =  1. 

Vectors  in  the  probability  simplex  correspond  to  probability  distributions  on  a  set 
with  n  elements,  with  Xi  interpreted  as  the  probability  of  the  ith  element. 


To  describe  the  simplex  (2.7)  as  a  polyhedron,  i.e.,  in  the  form  (2.6),  we  proceed 
as  follows.  By  definition,  x  £  C  if  and  only  if  x  =  OqVo  +  0\Vi  -\ —  •  +  6kVk  for  some 
0  X  0  with  1 1  9  =  1.  Equivalently,  if  we  define  y  =  (0i, ...  ,6k)  and 

B=[v  i-v0  ■■■  vk-v0]e  R”xfe, 

we  can  say  that  x  £  C  if  and  only  if 

x  =  vo  +  By  (2-8) 

for  some  y  >:  0  with  1  Ty  <  1.  Now  we  note  that  affine  independence  of  the 
points  vo, ...  ,Vk  implies  that  the  matrix  B  has  rank  k.  Therefore  there  exists  a 
nonsingular  matrix  A  =  (Ai,A2)  £  Rnxrl  such  that 

I  ' 

0  ' 

Multiplying  (2.8)  on  the  left  with  A,  we  obtain 


AB  = 


Ax 

A2 


B  = 


A\x  =  Aivq  +  y,  A2x  =  A2vo- 


From  this  we  see  that  x  £  C  if  and  only  if  A2x  =  A2v o,  and  the  vector  y  = 
Aix  —  A\Vo  satisfies  y  y  0  and  1  Ty  <  1.  In  other  words  we  have  x  £  C  if  and  only 
if 

A2x  =  A2vq,  A\x  >r  A\vo,  lTj4ia;  <  1  +  lr A\vo, 

which  is  a  set  of  linear  equalities  and  inequalities  in  x,  and  so  describes  a  polyhe¬ 
dron. 
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Convex  hull  description  of  polyhedra 

The  convex  hull  of  the  finite  set  {iq, . . . ,  Vk}  is 


convjui, . . .  ,vk}  =  {Oivi  -\ - b  dkVk  |  9  y  0,  1 T0  =  1}. 

This  set  is  a  polyhedron,  and  bounded,  but  (except  in  special  cases,  e.g .,  a  simplex) 
it  is  not  simple  to  express  it  in  the  form  (2.5),  i.e.,  by  a  set  of  linear  equalities  and 
inequalities. 

A  generalization  of  this  convex  hull  description  is 

{0iVi  +  ■  ■  ■  +  OkVk  |  0\  +  ■  ■  ■  +  6m  =  1,  dj  >  0,  i  =  1, . . . ,  k},  (2.9) 

where  m  <  k.  Here  we  consider  nonnegative  linear  combinations  of  iq,  but  only 
the  first  m  coefficients  are  required  to  sum  to  one.  Alternatively,  we  can  inter¬ 
pret  (2.9)  as  the  convex  hull  of  the  points  v\, . . .  ,vm,  plus  the  conic  hull  of  the 
points  vrn+i , . . . ,  Vfc.  The  set  (2.9)  defines  a  polyhedron,  and  conversely,  every 
polyhedron  can  be  represented  in  this  form  (although  we  will  not  show  this). 

The  question  of  how  a  polyhedron  is  represented  is  subtle,  and  has  very  im¬ 
portant  practical  consequences.  As  a  simple  example  consider  the  unit  ball  in  the 
foo-norm  in  R™, 

C  =  {x  |  \xi\  <1,  i  =  1,  .  . .  ,n}. 

The  set  C  can  be  described  in  the  form  (2.5)  with  2 n  linear  inequalities  ±e[x  <  1, 
where  ej  is  the  *th  unit  vector.  To  describe  it  in  the  convex  hull  form  (2.9)  requires 
at  least  2n  points: 

C  =  convjui, . . . ,  u2«}, 

where  iq,. . .  ,u2>»  are  the  2"  vectors  all  of  whose  components  are  1  or  —1.  Thus 
the  size  of  the  two  descriptions  differs  greatly,  for  large  n. 


2.2.5  The  positive  semidefinite  cone 

We  use  the  notation  S"  to  denote  the  set  of  symmetric  n  x  n  matrices, 

SWJXe  Rnx”  I  X  =  XT}, 

which  is  a  vector  space  with  dimension  n(n  +  l)/2.  We  use  the  notation  S"  to 
denote  the  set  of  symmetric  positive  semidefinite  matrices: 

S"  =  {X  £  Sn  |  AT  ^  0}, 

and  the  notation  S”+  to  denote  the  set  of  symmetric  positive  definite  matrices: 

s”+  =  (i  e  s”  1 1  >-  o}. 


(This  notation  is  meant  to  be  analogous  to  R+,  which  denotes  the  nonnegative 
reals,  and  R++,  which  denotes  the  positive  reals.) 


2.3  Operations  that  preserve  convexity 
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Figure  2.12  Boundary  of  positive  semidefinite  cone  in  S2. 


The  set  S"  is  a  convex  cone:  if  6\,  62  >  0  and  A,  B  £  S" ,  then  9\A  +  92B  £  S"  . 
This  can  be  seen  directly  from  the  definition  of  positive  semidefiniteness:  for  any 
x  €  IV  ',  we  have 

xt(9iA  +  92B)x  =  9\xT Ax  +  92xT Bx  >  0, 
if  A  >:  0,  B  >:  0  and  9±,  02  >  0. 


Example  2.6  Positive  semidefinite  cone  in  S2.  We  have 


x  y 
y  z 


es 


*  >  0,  z  >  0,  xz  >  y 


The  boundary  of  this  cone  is  shown  in  figure  2.12,  plotted  in  R3  as  ( x,y,z ). 


2.3  Operations  that  preserve  convexity 

In  this  section  we  describe  some  operations  that  preserve  convexity  of  sets,  or 
allow  us  to  construct  convex  sets  from  others.  These  operations,  together  with  the 
simple  examples  described  in  §2.2,  form  a  calculus  of  convex  sets  that  is  useful  for 
determining  or  establishing  convexity  of  sets. 


36 


2  Convex  sets 


2.3.1  Intersection 

Convexity  is  preserved  under  intersection:  if  Si  and  S2  are  convex,  then  S 1  D  S2  is 
convex.  This  property  extends  to  the  intersection  of  an  infinite  number  of  sets:  if 
Sa  is  convex  for  every  a  €  A,  then  |")ag^  ^ a  convex-  (Subspaces,  affine  sets,  and 
convex  cones  are  also  closed  under  arbitrary  intersections.)  As  a  simple  example, 
a  polyhedron  is  the  intersection  of  halfspaces  and  hyperplanes  (which  are  convex), 
and  therefore  is  convex. 


Example  2.7  The  positive  semidefinite  cone  S"  can  be  expressed  as 

P){.Y  £  Sn  |  zTXz  >  0}. 

z^O 

For  each  z  ■=/=■  0,  zT Xz  is  a  (not  identically  zero)  linear  function  of  X,  so  the  sets 

{X  £  Sn  |  zTXz  >  0} 

are,  in  fact,  halfspaces  in  S”.  Thus  the  positive  semidefinite  cone  is  the  intersection 
of  an  infinite  number  of  halfspaces,  and  so  is  convex. 


Example  2.8  We  consider  the  set 

S  =  {x  £  Rm  |  |p(t)|  <  1  for  |t|  <  7t/3},  (2-10) 

where  p(t)  =  X/l—i  Xk  cos  dfi®  se^  ^  can  exPressed  as  the  intersection  of  an 
infinite  number  of  slabs:  S  =  D |t| <ir/3  wllere 

St  =  {x  |  —  1  <  (cost, ...,  cos  mt)T  x  <  1}, 

and  so  is  convex.  The  definition  and  the  set  are  illustrated  in  figures  2.13  and  2.14, 
for  m  =  2. 


In  the  examples  above  we  establish  convexity  of  a  set  by  expressing  it  as  a 
(possibly  infinite)  intersection  of  halfspaces.  We  will  see  in  §2.5.1  that  a  converse 
holds:  every  closed  convex  set  S  is  a  (usually  infinite)  intersection  of  halfspaces. 
In  fact,  a  closed  convex  set  S  is  the  intersection  of  all  halfspaces  that  contain  it: 

S  =  P|  [H  I  T~L  halfspace,  S  C  H}. 


2.3.2  Affine  functions 

Recall  that  a  function  /  :  R"  — >  Rm  is  affine  if  it  is  a  sum  of  a  linear  function  and 
a  constant,  i.e.,  if  it  has  the  form  f{x)  =  Ax  +  b ,  where  A  £  Rmxra  and  b  £  Rm. 
Suppose  S  C  R"  is  convex  and  /  :  R"  —1  Rm  is  an  affine  function.  Then  the  image 
of  S  under  /, 


f(S)  =  {f{x)  |  a:  £  -S'} 


2.3  Operations  that  preserve  convexity 
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Figure  2.13  Three  trigonometric  polynomials  associated  with  points  in  the 
set  S  defined  in  (2.10),  for  m  =  2.  The  trigonometric  polynomial  plotted 
with  dashed  line  type  is  the  average  of  the  other  two. 


Figure  2.14  The  set  S  defined  in  (2.10),  for  m  =  2,  is  shown  as  the  white 
area  in  the  middle  of  the  plot.  The  set  is  the  intersection  of  an  infinite 
number  of  slabs  (20  of  which  are  shown),  hence  convex. 
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is  convex.  Similarly,  if  /  :  Rfc  — >  R"  is  an  affine  function,  the  inverse  image  of  S 
under  /, 

r1(S)  =  {x\f(x)€S}, 

is  convex. 

Two  simple  examples  are  scaling  and  translation.  If  S  C  R"  is  convex,  a  £  R, 
and  a  £  R" ,  then  the  sets  aS  and  S  +  a  are  convex,  where 

aS  =  {ax  |  x  £  S},  S-\-a  =  {x  +  a\  x&  S}. 

The  projection  of  a  convex  set  onto  some  of  its  coordinates  is  convex:  if  S  C 
Rm  x  R"  is  convex,  then 

T  =  {x\  £  Rm  |  {xi,x2)  £  S  for  some  x2  £  R"} 

is  convex. 

The  sum  of  two  sets  is  defined  as 

Si  +  S2  =  {x  +  y  I  x  £  Si,  y  £  S2}. 

If  Si  and  S2  are  convex,  then  Si  +  S2  is  convex.  To  see  this,  if  Si  and  S2  are 
convex,  then  so  is  the  direct  or  Cartesian  product 

Six  S2  =  {(xi,x2)  |  xi  £  5i,  x2  £  S2}. 

The  image  of  this  set  under  the  linear  function  f(x i,x2)  =  Xi  +  x2  is  the  sum 

Si  +  S2. 

We  can  also  consider  the  partial  sum  of  Si,  S2  £  R"  x  Rm,  defined  as 

S  =  {( x,yi  +y2)  |  (x,yi)  £  Si,  (x,y2)  £  5*2}, 

where  x  £  R'*  and  y,  £  Rm.  For  m  =  0,  the  partial  sum  gives  the  intersection  of 
Si  and  S2;  for  n  =  0,  it  is  set  addition.  Partial  sums  of  convex  sets  are  convex  (see 
exercise  2.16). 


Example  2.9  Polyhedron.  The  polyhedron  { x  \  Ax  A  b,  Cx  =  d}  can  be  expressed  as 
the  inverse  image  of  the  Cartesian  product  of  the  nonnegative  orthant  and  the  origin 
under  the  affine  function  f(x)  =  (b  —  Ax,  d  —  Cx): 

{x  |  Ax  ~<  b,  Cx  =  d}  =  {x  |  f{x)  £  R+  x  {0}}. 


Example  2.10  Solution  set  of  linear  matrix  inequality.  The  condition 

A(x)  =  X1A1  H - +xnAn^.B,  (2-11) 

where  B,  Ai  £  Sm,  is  called  a  linear  matrix  inequality  (LMI)  in  x.  (Note  the  similarity 
to  an  ordinary  linear  inequality, 

aT x  =  xiai  +  ■  •  •  +  x„an  <  b, 

with  b,  at  £  R.) 

The  solution  set  of  a  linear  matrix  inequality,  {x  \  A(x)  P  B},  is  convex.  Indeed, 
it  is  the  inverse  image  of  the  positive  semidefinite  cone  under  the  affine  function 
/  :  Rn  ->•  Sm  given  by  f(x)  =  B  -  A(x). 


2.3  Operations  that  preserve  convexity 
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Example  2.11  Hyperbolic  cone.  The  set 

{x  |  xT Px  <  ( cT x )2,  cT x  >  0} 

where  P  G  S+  and  c  G  R’1 ,  is  convex,  since  it  is  the  inverse  image  of  the  second-order 
cone, 

{( z,t )  |  zT  z  <t2,  t>  0}, 
under  the  affine  function  f(x)  =  (P1^2x,  cT x). 


Example  2.12  Ellipsoid.  The  ellipsoid 

£  =  {x\  (x  —  xc)TP~1(x  —  xc)  <  1}, 

where  P  G  S++,  is  the  image  of  the  unit  Euclidean  ball  {n  |  || w || 2  <  1}  under  the 
affine  mapping  f(u)  =  P1'2u  +  xc.  (It  is  also  the  inverse  image  of  the  unit  ball  under 
the  affine  mapping  g(x)  =  P~1^2{x  —  xc )•) 


2.3.3  Linear-fractional  and  perspective  functions 

In  this  section  we  explore  a  class  of  functions,  called  linear-fractional ,  that  is  more 
general  than  affine  but  still  preserves  convexity. 

The  perspective  function 

We  define  the  perspective  function  P  :  R,!+1  — >•  Rn,  with  domain  domP  =  R'1  x 
R-++!  as  P(z,  t)  —  z/t.  (Here  R++  denotes  the  set  of  positive  numbers:  R++  = 
{x  G  R  |  x  >  0}.)  The  perspective  function  scales  or  normalizes  vectors  so  the  last 
component  is  one,  and  then  drops  the  last  component. 


Remark  2.1  We  can  interpret  the  perspective  function  as  the  action  of  a  pin-hole 
camera.  A  pin-hole  camera  (in  R3)  consists  of  an  opaque  horizontal  plane  X3  =  0, 
with  a  single  pin-hole  at  the  origin,  through  which  light  can  pass,  and  a  horizontal 
image  plane  X3  =  —1.  An  object  at  x,  above  the  camera  (i.e.,  with  X3  >  0),  forms 
an  image  at  the  point  —(xi/x3,X2/x3,l)  on  the  image  plane.  Dropping  the  last 
component  of  the  image  point  (since  it  is  always  —1),  the  image  of  a  point  at  x 
appears  at  y  =  —  (*i/x3,  *2/^3)  =  —P(x)  on  the  image  plane.  This  is  illustrated  in 
figure  2.15. 


If  C  C  dom  P  is  convex,  then  its  image 

P(C)  =  {P{x)  \  xeC} 

is  convex.  This  result  is  certainly  intuitive:  a  convex  object,  viewed  through  a 
pin-hole  camera,  yields  a  convex  image.  To  establish  this  fact  we  show  that  line 
segments  are  mapped  to  line  segments  under  the  perspective  function.  (This  too 
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dark  horizontal  line  represents  the  plane  *3  =  0  in  R3,  which  is  opaque, 
except  for  a  pin-hole  at  the  origin.  Objects  or  light  sources  above  the  plane 
appear  on  the  image  plane  X3  =  —1,  which  is  shown  as  the  lighter  horizontal 
line.  The  mapping  of  the  position  of  a  source  to  the  position  of  its  image  is 
related  to  the  perspective  function. 


makes  sense:  a  line  segment,  viewed  through  a  pin-hole  camera,  yields  a  line  seg¬ 
ment  image.)  Suppose  that  x  =  (x,xn+ 1),  y  =  (y,yn+ 1)  £  R"+1  with  xn+i  >  0, 
yn+ 1  >  0.  Then  for  0  <  9  <  1, 

P{dx  +  (1  -  0)y)  =  dl  +  ^ - =  yP{x)  +  (1  -  n)P(y), 

Vxn+ 1  +  (1  -  v)yn+ 1 

where 

_  @Xn+l  r  n  -.1 

ua:n+i  +  (1  —  0)yn+i 

This  correspondence  between  9  and  /.i  is  monotonic:  as  9  varies  between  0  and  1 
(which  sweeps  out  the  line  segment  [x,  j/]),  y,  varies  between  0  and  1  (which  sweeps 
out  the  line  segment  [P(x), P(y)]).  This  shows  that  P([x,y\)  =  [P(x),P(y)\. 

Now  suppose  C  is  convex  with  C  C  domP  ( i.e .,  xn+i  >  0  for  all  x  £  C),  and 
x,  y  £  C.  To  establish  convexity  of  P(C)  we  need  to  show  that  the  line  segment 
[P(x),P(y)]  is  in  P(C).  But  this  line  segment  is  the  image  of  the  line  segment 
[x,y]  under  P,  and  so  lies  in  P(C). 

The  inverse  image  of  a  convex  set  under  the  perspective  function  is  also  convex: 
if  C  C  R”  is  convex,  then 

p-\C)  =  {(x,i)  £  R”+1  |  x/t  £  C,  t>  0} 

is  convex.  To  show  this,  suppose  (x,t)  £  P“1(C),  (y,s)  £  P_1(C'),  and  0  <  9  <  1. 
We  need  to  show  that 


0(x,t)  +  (l-0)(2/,S)£P-1(C), 


6>x  +  (1  -  9)y 
9t  +  (l-9)s 


i.e.,  that 


2.3  Operations  that  preserve  convexity 
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(9t  +  (1  —  9)s  >  0  is  obvious).  This  follows  from 


9x  +  (1  —  9)y 
9t+(  1  —  9)s 


y(x/t)  +  (1  -  n)(y/s), 


where 


9t 

9t  +  (  1  —  9)s 


e  [0,1]. 


Linear-fractional  functions 

A  linear-fractional  function  is  formed  by  composing  the  perspective  function  with 
an  affine  function.  Suppose  g  :  R"  — >  Rm+1  is  affine,  i.e., 


A  ' 

'  b ' 

T 

C 

X  + 

d 

(2.12) 


where  A  £  Rmxn,  b  £  Rm,  c  £  R",  and  d  £  R.  The  function  /  :  R"  — >  Rm  given 
by  /  =  P°g,  i.e., 

f(x)  =  (Ax  +  b)/(cTx  +  d),  domf  =  {x\cTx  +  d>0},  (2.13) 

is  called  a  linear-fractional  (or  projective )  function.  If  c  =  0  and  d  >  0,  the  domain 
of  /  is  R™,  and  /  is  an  affine  function.  So  we  can  think  of  affine  and  linear  functions 
as  special  cases  of  linear-fractional  functions. 


Remark  2.2  Projective  interpretation.  It  is  often  convenient  to  represent  a  linear- 
fractional  function  as  a  matrix 


Q 


A  b 

cT  d 


G  x  (n+1) 


(2.14) 


that  acts  on  (multiplies)  points  of  form  ( x ,  1),  which  yields  (Ax  +  b,cTx  +  d).  This 
result  is  then  scaled  or  normalized  so  that  its  last  component  is  one,  which  yields 

(/(*)>  I)- 

This  representation  can  be  interpreted  geometrically  by  associating  R"  with  a  set 
of  rays  in  Rn+1  as  follows.  With  each  point  z  in  R"  we  associate  the  (open)  ray 
V(z)  =  {t(z,  1)  |  t  >  0}  in  Rn+1.  The  last  component  of  this  ray  takes  on  positive 
values.  Conversely  any  ray  in  Rn+1,  with  base  at  the  origin  and  last  component 
which  takes  on  positive  values,  can  be  written  as  V(v )  =  {t(v,  1)  |  t  >  0}  for  some 
v  £  R".  This  (projective)  correspondence  V  between  R71  and  the  halfspace  of  rays 
with  positive  last  component  is  one-to-one  and  onto. 

The  linear-fractional  function  (2.13)  can  be  expressed  as 

f(x)=V~1(QV(x)). 

Thus,  we  start  with  x  £  dom/,  i.e.,  cTx  +  d  >  0.  We  then  form  the  ray  V(x)  in 
pn+i  rphg  iinear  transformation  with  matrix  Q  acts  on  this  ray  to  produce  another 
ray  QV(x).  Since  x  £  dom  /,  the  last  component  of  this  ray  assumes  positive  values. 
Finally  we  take  the  inverse  projective  transformation  to  recover  /(*). 
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X\  X\ 


Figure  2.16  Left.  A  set  C  C  R2.  The  dashed  line  shows  the  boundary  of 
the  domain  of  the  linear-fractional  function  f(x)  =  x/{x\  +  *2  +  1)  with 
dom /  =  {(xi, X2)  |  xi  +  X2  +  1  >  0}.  Right.  Image  of  C  under  /.  The 
dashed  line  shows  the  boundary  of  the  domain  of  /_1. 


Like  the  perspective  function,  linear-fractional  functions  preserve  convexity.  If 
C  is  convex  and  lies  in  the  domain  of  /  ( i.e .,  cTx  +  d  >  0  for  x  £  C),  then  its 
image  f(C )  is  convex.  This  follows  immediately  from  results  above:  the  image  of  C 
under  the  affine  mapping  (2.12)  is  convex,  and  the  image  of  the  resulting  set  under 
the  perspective  function  P,  which  yields  /(C),  is  convex.  Similarly,  if  C  C  R™  is 
convex,  then  the  inverse  image  /~X(C)  is  convex. 


Example  2.13  Conditional  probabilities.  Suppose  u  and  v  are  random  variables 
that  take  on  values  in  {1,  ...,n}  and  {l,...,m},  respectively,  and  let  ptj  denote 
prob(w  =  i,  v  =  j).  Then  the  conditional  probability  f,j  =  prob(it  =  i\v  =  j )  is 
given  by 

,  _  PH 

J  ij  n 

2^ik= 1  Pki 

Thus  /  is  obtained  by  a  linear-fractional  mapping  from  p. 

It  follows  that  if  C  is  a  convex  set  of  joint  probabilities  for  ( u ,  v),  then  the  associated 
set  of  conditional  probabilities  of  u  given  v  is  also  convex. 


Figure  2.16  shows  a  set  C  C 
function 


R2,  and  its  image  under  the  linear- fractional 


f(x) 


1 


X\  +  X2  +  1 


dom  /  =  {(xi, X2)  |  x\  +  x2  +  1  >  0}. 


2.4  Generalized  inequalities 
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2.4  Generalized  inequalities 

2.4.1  Proper  cones  and  generalized  inequalities 

A  cone  K  C  R"  is  called  a  proper  cone  if  it  satisfies  the  following: 

•  I\  is  convex. 

•  K  is  closed. 

•  K  is  solid ,  which  means  it  has  nonempty  interior. 

•  K  is  pointed ,  which  means  that  it  contains  no  line  (or  equivalently,  x  G 
K,  —  x  G  K  =>■  x  =  0). 

A  proper  cone  K  can  be  used  to  define  a  generalized  inequality ,  which  is  a  partial 
ordering  on  R"  that  has  many  of  the  properties  of  the  standard  ordering  on  R. 
We  associate  with  the  proper  cone  K  the  partial  ordering  on  R”  defined  by 

X  <k  V  V  -  X  G  K. 

We  also  write  x  >zk  y  for  y  Yk  x.  Similarly,  we  define  an  associated  strict  partial 
ordering  by 

x  -<k  y  y  —  x  G  int  K , 

and  write  x  Yk  y  for  y  -<k  x.  (To  distinguish  the  generalized  inequality  A x 
from  the  strict  generalized  inequality,  we  sometimes  refer  to  -<K  as  the  nonstrict 
generalized  inequality.) 

When  K  =  R+,  the  partial  ordering  YK  is  the  usual  ordering  <  on  R,  and 
the  strict  partial  ordering  -<k  is  the  same  as  the  usual  strict  ordering  <  on  R. 
So  generalized  inequalities  include  as  a  special  case  ordinary  (nonstrict  and  strict) 
inequality  in  R. 


Example  2.14  Nonnegative  orthant  and  componentwise  inequality.  The  nonnegative 
orthant  K  =  R"  is  a  proper  cone.  The  associated  generalized  inequality  -<k  corre¬ 
sponds  to  componentwise  inequality  between  vectors:  x  ~<k  y  means  that  Xi  <  yi, 
i  =  1,  The  associated  strict  inequality  corresponds  to  componentwise  strict 

inequality:  x  -<k  y  means  that  Xi  <  yi,  i  =  1, . . . ,  n. 

The  nonstrict  and  strict  partial  orderings  associated  with  the  nonnegative  orthant 
arise  so  frequently  that  we  drop  the  subscript  R" ;  it  is  understood  when  the  symbol 
-<  or  -<  appears  between  vectors. 


Example  2.15  Positive  semidefinite  cone  and  matrix  inequality.  The  positive  semidef- 
inite  cone  S”  is  a  proper  cone  in  S".  The  associated  generalized  inequality  <k  is  the 
usual  matrix  inequality:  X  <k  Y  means  Y  —  X  is  positive  semidefinite.  The  inte¬ 
rior  of  S"  (in  Sn)  consists  of  the  positive  definite  matrices,  so  the  strict  generalized 
inequality  also  agrees  with  the  usual  strict  inequality  between  symmetric  matrices: 
X  -<k  Y  means  Y  —  X  is  positive  definite. 

Here,  too,  the  partial  ordering  arises  so  frequently  that  we  drop  the  subscript:  for 
symmetric  matrices  we  write  simply  X  G  Y  or  X  -<  Y.  It  is  understood  that  the 
generalized  inequalities  are  with  respect  to  the  positive  semidefinite  cone. 
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Example  2.16  Cone  of  polynomials  nonnegative  on  [0, 1].  Let  K  be  defined  as 

K  =  {c  €  Rn  |  ci  +  czt  +  ■  ■  ■  +  cnt ™_1  >  0  for  t  £  [0, 1]},  (2-15) 

i.e.,  K  is  the  cone  of  (coefficients  of)  polynomials  of  degree  n—  1  that  are  nonnegative 
on  the  interval  [0, 1].  It  can  be  shown  that  K  is  a  proper  cone;  its  interior  is  the  set 
of  coefficients  of  polynomials  that  are  positive  on  the  interval  [0, 1]. 

Two  vectors  c,  d  £  R™  satisfy  c  <k  d  if  and  only  if 

Cl  +  C2t  +  •••-{-  Cnt  +  dl  +  d2t  +  •  •  ■  T  dnt 

for  all  t  £  [0, 1]. 

Properties  of  generalized  inequalities 

A  generalized  inequality  dx  satisfies  many  properties,  such  as 

•  d k  is  preserved  under  addition :  if  x  dx  V  and  u  A K  v,  then  x  +  u  A K  y  +  v. 

•  di k  is  transitive :  if  x  dx  V  and  y  dx  z  then  x  dx  z. 

•  dx  is  preserved  under  nonnegative  scaling:  if  x  dx  y  and  a  >  0  then 

ax  dx  ay. 

•  dx  is  reflexive:  x  dx  x. 

•  dx  is  antisymmetric:  if  x  dx  y  and  y  dx  x,  then  x  =  y. 

•  dx  is  preserved  under  limits:  if  Xi  dx  Vi  for  i  =  1,  2, . . Xi  — >  x  and  t/j  — >  y 

as  i  — >  oo,  then  x  dx  V- 

The  corresponding  strict  generalized  inequality  ~<k  satisfies,  for  example, 

•  if  x  -<x  y  then  x  dx  y- 

•  if  x  ~<x  V  and  u  dx  v  then  x  +  u  +k  y  +  v. 

•  if  x  +x  y  and  a  >  0  then  ax  +x  ay. 

•  x  -flx  x. 

•  if  x  <x  y,  then  for  u  and  v  small  enough,  x  +  u+x  y  +  v. 

These  properties  are  inherited  from  the  definitions  of  dx  and  -<x,  and  the  prop¬ 
erties  of  proper  cones;  see  exercise  2.30. 


2.4  Generalized  inequalities 
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2.4.2  Minimum  and  minimal  elements 

The  notation  of  generalized  inequality  ( i.e .,  Ajy,  ~<k)  is  meant  to  suggest  the 
analogy  to  ordinary  inequality  on  R  (i.e.,  <,  <).  While  many  properties  of  ordinary 
inequality  do  hold  for  generalized  inequalities,  some  important  ones  do  not.  The 
most  obvious  difference  is  that  <  on  R  is  a  linear  ordering :  any  two  points  are 
comparable ,  meaning  either  x  <  y  or  y  <  x.  This  property  does  not  hold  for 
other  generalized  inequalities.  One  implication  is  that  concepts  like  minimum  and 
maximum  are  more  complicated  in  the  context  of  generalized  inequalities.  We 
briefly  discuss  this  in  this  section. 

We  say  that  x  £  S  is  the  minimum  element  of  S  (with  respect  to  the  general¬ 
ized  inequality  ^±k)  if  for  every  y  €  S  we  have  x  <k  U-  We  define  the  maximum 
element  of  a  set  S,  with  respect  to  a  generalized  inequality,  in  a  similar  way.  If  a 
set  has  a  minimum  (maximum)  element,  then  it  is  unique.  A  related  concept  is 
minimal  element.  We  say  that  x  £  S  is  a  minimal  element  of  S  (with  respect  to 
the  generalized  inequality  kK)  if  y  £  S,  y  x  only  if  y  =  x.  We  define  maxi¬ 
mal  element  in  a  similar  way.  A  set  can  have  many  different  minimal  (maximal) 
elements. 

We  can  describe  minimum  and  minimal  elements  using  simple  set  notation.  A 
point  x  £  S  is  the  minimum  element  of  S  if  and  only  if 

SCi  +  K. 

Here  x  +  K  denotes  all  the  points  that  are  comparable  to  x  and  greater  than  or 
equal  to  x  (according  to  ^k)-  A  point  x  £  S  is  a  minimal  element  if  and  only  if 

( x  —  K )  n5  =  {x}. 

Here  x  —  K  denotes  all  the  points  that  are  comparable  to  x  and  less  than  or  equal 
to  x  (according  to  kjf);  the  only  point  in  common  with  S  is  x. 

For  K  =  R+,  which  induces  the  usual  ordering  on  R,  the  concepts  of  minimal 
and  minimum  are  the  same,  and  agree  with  the  usual  definition  of  the  minimum 
element  of  a  set. 


Example  2.17  Consider  the  cone  R+,  which  induces  componentwise  inequality  in 
R2.  Here  we  can  give  some  simple  geometric  descriptions  of  minimal  and  minimum 
elements.  The  inequality  x  <  y  means  y  is  above  and  to  the  right  of  x.  To  say  that 
a:  £  S'  is  the  minimum  element  of  a  set  S  means  that  all  other  points  of  S  lie  above 
and  to  the  right.  To  say  that  a:  is  a  minimal  element  of  a  set  S  means  that  no  other 
point  of  S  lies  to  the  left  and  below  x.  This  is  illustrated  in  figure  2.17. 


Example  2.18  Minimum  and  minimal  elements  of  a  set  of  symmetric  matrices.  We 
associate  with  each  A  £  S++  an  ellipsoid  centered  at  the  origin,  given  by 

£a  =  {x\  xT A_1x  <  1}. 

We  have  A  A  B  if  and  only  if  £a  Q£b- 
Let  vi , ...  ,Vk  £  R"  be  given  and  define 

S  *t  {P  £  S"+  |  vj  P~xVi  <  1,  i  =  l,...,k}, 
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Figure  2.17  Left.  The  set  Si  has  a  minimum  element  x\  with  respect  to 
componentwise  inequality  in  R2.  The  set  xi  +  K  is  shaded  lightly;  xi  is 
the  minimum  element  of  Si  since  Si  C  xi  +  K .  Right.  The  point  X2  is  a 
minimal  point  of  S2.  The  set  *2  —  K  is  shown  lightly  shaded.  The  point  X2 
is  minimal  because  X2  —  K  and  S2  intersect  only  at  *2. 


which  corresponds  to  the  set  of  ellipsoids  that  contain  the  points  vi,...,Vk-  The 
set  S  does  not  have  a  minimum  element:  for  any  ellipsoid  that  contains  the  points 
Vi, . . .  ,Vk  we  can  find  another  one  that  contains  the  points,  and  is  not  comparable 
to  it.  An  ellipsoid  is  minimal  if  it  contains  the  points,  but  no  smaller  ellipsoid  does. 
Figure  2.18  shows  an  example  in  R2  with  k  =  2. 


2.5  Separating  and  supporting  hyperplanes 

2.5.1  Separating  hyperplane  theorem 

In  this  section  we  describe  an  idea  that  will  be  important  later:  the  use  of  hyper¬ 
planes  or  affine  functions  to  separate  convex  sets  that  do  not  intersect.  The  basic 
result  is  the  separating  hyperplane  theorem:  Suppose  C  and  D  are  nonempty  dis¬ 
joint  convex  sets,  i.e.,  C  CiD  =  0.  Then  there  exist  a^O  and  b  such  that  aTx  <  b 
for  all  x  €  C  and  aTx  >  b  for  all  x  £  D.  In  other  words,  the  affine  function  aTx—  b 
is  nonpositive  on  C  and  nonnegative  on  D.  The  hyperplane  {x  |  aTx  =  6}  is  called 
a  separating  hyperplane  for  the  sets  C  and  D ,  or  is  said  to  separate  the  sets  C  and 
D.  This  is  illustrated  in  figure  2.19. 

Proof  of  separating  hyperplane  theorem 

Here  we  consider  a  special  case,  and  leave  the  extension  of  the  proof  to  the  gen¬ 
eral  case  as  an  exercise  (exercise  2.22).  We  assume  that  the  (Euclidean)  distance 
between  C  and  D1  defined  as 


dist(C,  D)  =  inf{||u  —  ri||2  |  u  G  C,  v  €  D} 


2.5  Separating  and  supporting  hyperplanes 


47 


Figure  2.18  Three  ellipsoids  in  R2,  centered  at  the  origin  (shown  as  the 
lower  dot),  that  contain  the  points  shown  as  the  upper  dots.  The  ellipsoid 
£i  is  not  minimal,  since  there  exist  ellipsoids  that  contain  the  points,  and 
are  smaller  ( e.g .,  £3).  £3  is  not  minimal  for  the  same  reason.  The  ellipsoid 
£2  is  minimal,  since  no  other  ellipsoid  (centered  at  the  origin)  contains  the 
points  and  is  contained  in  £2. 


Figure  2.19  The  hyperplane  {*  |  aTx  =  b}  separates  the  disjoint  convex  sets 
C  and  D.  The  affine  function  aTx  —  b  is  nonpositive  on  C  and  nonnegative 
on  D. 
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C 


V  xc 

tr\  x 


D 


Figure  2.20  Construction  of  a  separating  hyperplane  between  two  convex 
sets.  The  points  c  £  C  and  d  £  D  are  the  pair  of  points  in  the  two  sets  that 
are  closest  to  each  other.  The  separating  hyperplane  is  orthogonal  to,  and 
bisects,  the  line  segment  between  c  and  d. 


is  positive,  and  that  there  exist  points  c  £  C  and  d  £  D  that  achieve  the  minimum 
distance,  i.e.,  ||c—  d ||2  =  dist(C,  D).  (These  conditions  are  satisfied,  for  example, 
when  C  and  D  are  closed  and  one  set  is  bounded.) 

Define 


a  =  d  —  c, 


Nil -Nil 
2 


We  will  show  that  the  affine  function 


f(x)  =  aTx  —  b  =  (d  —  c)T(x  —  (l/2)(d  +  c)) 

is  nonpositive  on  C  and  nonnegative  on  D ,  i.e.,  that  the  hyperplane  {2;  |  aTx  =  b} 
separates  C  and  D.  This  hyperplane  is  perpendicular  to  the  line  segment  between 
c  and  d,  and  passes  through  its  midpoint,  as  shown  in  figure  2.20. 

We  first  show  that  /  is  nonnegative  on  D.  The  proof  that  /  is  nonpositive  on 
C  is  similar  (or  follows  by  swapping  C  and  D  and  considering  — /).  Suppose  there 
were  a  point  u  £  D  for  which 

f(u)  =  (d  —  c)T(u  —  (1/2 )(d  +  c))  <  0.  (2-16) 


We 


We 


can  express  f(u)  as 
f(u)  =  (d  —  c)T(u  —  d  +  (1/2 )(d  - 
see  that  (2.16)  implies  (d—  c)T  (u 

jt\\d  +  t{u-d)  -c\\l 


~c)) 

-d) 


t—0 


=  (d-  c)T(u  -d)  +  (l/2)||d  -  c\\\. 
<  0.  Now  we  observe  that 

=  2{d  —  c)T{u  —  d)  <  0, 


so  for  some  small  t  >  0,  with  t  <  1,  we  have 


II d  +  t(u  -  d)  -  c||2  <  ||d  -  c||2, 
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i.e.,  the  point  d  +  t(u  —  d)  is  closer  to  c  than  d  is.  Since  D  is  convex  and  contains 
d  and  u,  we  have  d  +  t(u  —  d)  6  D.  But  this  is  impossible,  since  d  is  assumed  to  be 
the  point  in  D  that  is  closest  to  C. 


Example  2.19  Separation  of  an  affine  and  a  convex  set.  Suppose  C  is  convex  and 
D  is  affine,  i.e.,  D  =  {Fu  +  g  \  u  £  Rm},  where  F  £  R’ixm.  Suppose  C  and  D  are 
disjoint,  so  by  the  separating  hyperplane  theorem  there  are  a  ^  0  and  b  such  that 
aT x  <  b  for  all  x  £  C  and  aTx  >  b  for  all  x  £  D. 

Now  aT x  >  b  for  all  x  £  D  means  aT Fu  >  b  —  aT g  for  all  u  £  Rm.  But  a  linear 
function  is  bounded  below  on  Rm  only  when  it  is  zero,  so  we  conclude  aT F  =  0  (and 
hence,  b  <  aT g). 

Thus  we  conclude  that  there  exists  a  ^  0  such  that  FT  a  =  0  and  aTx  <  aTg  for  all 
z  £  C. 


Strict  separation 

The  separating  hyperplane  we  constructed  above  satisfies  the  stronger  condition 
that  aTx  <  b  for  all  x  £  C  and  aTx  >  b  for  all  x  €  D.  This  is  called  strict 
separation  of  the  sets  C  and  D.  Simple  examples  show  that  in  general,  disjoint 
convex  sets  need  not  be  strictly  separable  by  a  hyperplane  (even  when  the  sets  are 
closed;  see  exercise  2.23).  In  many  special  cases,  however,  strict  separation  can  be 
established. 


Example  2.20  Strict  separation  of  a  point  and  a  closed  convex  set.  Let  C  be  a  closed 
convex  set  and  xo  C.  Then  there  exists  a  hyperplane  that  strictly  separates  Xo 
from  C. 

To  see  this,  note  that  the  two  sets  C  and  B(x o,e)  do  not  intersect  for  some  e  >  0. 
By  the  separating  hyperplane  theorem,  there  exist  a  ^  0  and  b  such  that  aTx  <  b  for 
x  £  C  and  aTx  >  b  for  x  £  B(x o,  e). 

Using  B(x o,  e)  =  {zo  +  u  \  ||w|| 2  <  e},  the  second  condition  can  be  expressed  as 
aT (x 0  +  u)  >  b  for  all  || ii|| 2  <  e. 

The  u  that  minimizes  the  lefthand  side  is  u  =  —  ea./ 1| a || 2 ;  using  this  value  we  have 

aT x 0  —  e||a||2  >  b. 


Therefore  the  affine  function 


/(*)  =  aT  x  —  b  —  e||a||2/2 
is  negative  on  C  and  positive  at  xo- 

As  an  immediate  consequence  we  can  establish  a  fact  that  we  already  mentioned 
above:  a  closed  convex  set  is  the  intersection  of  all  halfspaces  that  contain  it.  Indeed, 
let  C  be  closed  and  convex,  and  let  S  be  the  intersection  of  all  halfspaces  containing 
C .  Obviously  x  £  C  =>  x  £  S.  To  show  the  converse,  suppose  there  exists  x  £  S, 
x  C.  By  the  strict  separation  result  there  exists  a  hyperplane  that  strictly  separates 
x  from  C,  i.e.,  there  is  a  halfspace  containing  C  but  not  x.  In  other  words,  x  0  S. 
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Converse  separating  hyperplane  theorems 

The  converse  of  the  separating  hyperplane  theorem  (i.e.,  existence  of  a  separating 
hyperplane  implies  that  C  and  D  do  not  intersect)  is  not  true,  unless  one  imposes 
additional  constraints  on  C  or  D,  even  beyond  convexity.  As  a  simple  counterex¬ 
ample,  consider  C  =  D  =  {0}  C  R.  Here  the  hyperplane  x  =  0  separates  C  and 
D. 

By  adding  conditions  on  C  and  D  various  converse  separation  theorems  can  be 
derived.  As  a  very  simple  example,  suppose  C  and  D  are  convex  sets,  with  C  open, 
and  there  exists  an  affine  function  /  that  is  nonpositive  on  C  and  nonnegative  on 
D.  Then  C  and  D  are  disjoint.  (To  see  this  we  first  note  that  /  must  be  negative 
on  C;  for  if  /  were  zero  at  a  point  of  C  then  /  would  take  on  positive  values  near 
the  point,  which  is  a  contradiction.  But  then  C  and  D  must  be  disjoint  since  / 
is  negative  on  C  and  nonnegative  on  d.)  Putting  this  converse  together  with  the 
separating  hyperplane  theorem,  we  have  the  following  result:  any  two  convex  sets 
C  and  D,  at  least  one  of  which  is  open,  are  disjoint  if  and  only  if  there  exists  a 
separating  hyperplane. 


Example  2.21  Theorem  of  alternatives  for  strict  linear  inequalities.  We  derive  the 
necessary  and  sufficient  conditions  for  solvability  of  a  system  of  strict  linear  inequal¬ 
ities 

Ax  -<  h.  (2.17) 

These  inequalities  are  infeasible  if  and  only  if  the  (convex)  sets 

Cm  {b-  Ax  |  x  E  R’1},  D  =  R++  =  {yeRm  |y^0} 

do  not  intersect.  The  set  D  is  open;  C  is  an  affine  set.  Hence  by  the  result  above,  C 
and  D  are  disjoint  if  and  only  if  there  exists  a  separating  hyperplane,  i.e.,  a  nonzero 
A  E  Rm  and  (i£  R  such  that  A  Ty  </ionC  and  A  Ty  >  /x  on  D. 

Each  of  these  conditions  can  be  simplified.  The  first  means  XT  (b  —  Ax)  <  /x  for  all  x. 
This  implies  (as  in  example  2.19)  that  ATA  =  0  and  A Tb  <  /x.  The  second  inequality 
means  A Ty  >  /x  for  all  y  y  0.  This  implies  /x  <  0  and  A  y  0,  A  ^  0. 

Putting  it  all  together,  we  find  that  the  set  of  strict  inequalities  (2.17)  is  infeasible  if 
and  only  if  there  exists  A  E  Rm  such  that 

A^0,  A  y  0,  AtA  =  0,  A Tb  <  0.  (2.18) 

This  is  also  a  system  of  linear  inequalities  and  linear  equations  in  the  variable  A  E  Rm . 
We  say  that  (2.17)  and  (2.18)  form  a  pair  of  alternatives:  for  any  data  A  and  b,  exactly 
one  of  them  is  solvable. 


2.5.2  Supporting  hyperplanes 

Suppose  C  C  Rn ,  and  Xq  is  a  point  in  its  boundary  bdC,  i.e., 

xo  €  bd  C  =  cl  C  \  int  C. 

If  a  ^  0  satisfies  aTx  <  aTx o  for  all  x  S  C,  then  the  hyperplane  {x  |  aTx  =  aTx o} 
is  called  a  supporting  hyperplane  to  C  at  the  point  xq ■  This  is  equivalent  to  saying 
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Figure  2.21  The  hyperplane  {x  \  aTx  =  aTx o}  supports  C  at  xq. 


that  the  point  Xq  and  the  set  C  are  separated  by  the  hyperplane  {x  \  aTx  =  aTx o}. 
The  geometric  interpretation  is  that  the  hyperplane  {a;  |  aTx  =  aTx o}  is  tangent 
to  C  at  a:o;  and  the  halfspace  {x  |  aTx  <  aTx o}  contains  C.  This  is  illustrated  in 
figure  2.21. 

A  basic  result,  called  the  supporting  hyperplane  theorem ,  states  that  for  any 
nonempty  convex  set  C ,  and  any  xq  6  bd  C,  there  exists  a  supporting  hyperplane  to 
C  at  xo  ■  The  supporting  hyperplane  theorem  is  readily  proved  from  the  separating 
hyperplane  theorem.  We  distinguish  two  cases.  If  the  interior  of  C  is  nonempty, 
the  result  follows  immediately  by  applying  the  separating  hyperplane  theorem  to 
the  sets  {a.’o}  and  int  C.  If  the  interior  of  C  is  empty,  then  C  must  lie  in  an  affine 
set  of  dimension  less  than  n,  and  any  hyperplane  containing  that  affine  set  contains 
C  and  Xo,  and  is  a  (trivial)  supporting  hyperplane. 

There  is  also  a  partial  converse  of  the  supporting  hyperplane  theorem:  If  a  set 
is  closed,  has  nonempty  interior,  and  has  a  supporting  hyperplane  at  every  point 
in  its  boundary,  then  it  is  convex.  (See  exercise  2.27.) 


2.6  Dual  cones  and  generalized  inequalities 

2.6.1  Dual  cones 

Let  K  be  a  cone.  The  set 

K*  =  {y  |  xTy  >  0  for  all  x  e  K}  (2-19) 

is  called  the  dual  cone  of  K.  As  the  name  suggests,  K*  is  a  cone,  and  is  always 
convex,  even  when  the  original  cone  K  is  not  (see  exercise  2.31). 

Geometrically,  y  €  K*  if  and  only  if  —y  is  the  normal  of  a  hyperplane  that 
supports  K  at  the  origin.  This  is  illustrated  in  figure  2.22. 


Example  2.22  Subspace.  The  dual  cone  of  a  subspace  V  C  Rn  (which  is  a  cone)  is 
its  orthogonal  complement  V±  =  {y  \  vTy  =  0  for  all  v  £  V}. 
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Figure  2.22  Left.  The  halfspace  with  inward  normal  y  contains  the  cone  K . 
so  y  £  K* .  Right.  The  halfspace  with  inward  normal  z  does  not  contain  K . 
so  2  0  A'*. 


Example  2.23  Nonnegative  orthant.  The  cone  R"  is  its  own  dual: 

xTy  >  0  for  all  x  Y  0  •<==>■  y  >  0. 

We  call  such  a  cone  self-dual. 


Example  2.24  Positive  semidefinite  cone.  On  the  set  of  symmetric  n  x  n  matrices 
Sn,  we  use  the  standard  inner  product  tr(XY)  =  y]'1  XijYg  (see  §A.1.1).  The 
positive  semidefinite  cone  S"  is  self-dual,  i.e.,  for  X,  Y  £  Sn, 

tr(XY)  >  0  for  all  A>0  <S=^>  Y  P  0. 

We  will  establish  this  fact. 

Suppose  Y  0  S".  Then  there  exists  q  £  R"  with 

qTYq  =  tr (qqTY)  <  0. 

Hence  the  positive  semidefinite  matrix  X  =  qqT  satisfies  tr(XY)  <  0;  it  follows  that 

y  #  (sir. 

Now  suppose  A',  Y  £  S"  .  We  can  express  A'  in  terms  of  its  eigenvalue  decomposition 
as  X  =  XqiqJ ,  where  (the  eigenvalues)  A<  >  0,  i  =  1, . . . ,  n.  Then  we  have 

(n  \  n 

Y  ^  A iqiqf  j  =  ^  A iqjYqi  >  0. 
i=  1  J  i= 1 

This  shows  that  Y  £  (S")*. 


Example  2.25  Dual  of  a  norm  cone.  Let  ||  •  ||  be  a  norm  on  Rn.  The  dual  of  the 
associated  cone  K  =  {(x,t)  £  R"+1  |  ||x||  <  t]  is  the  cone  defined  by  the  dual  norm, 
i.e., 

K*  =  {(«,«)  £  Rn+1  |  Hull.  <  v}, 


2.6  Dual  cones  and  generalized  inequalities 


53 


where  the  dual  norm  is  given  by  ||u||*  =  sup{wT£  |  ||x||  <  1}  (see  (A. 1.6)). 

To  prove  the  result  we  have  to  show  that 

xTu  +  tv  >  0  whenever  ||x||  <  t  -£=>  ||u||*  <  v.  (2.20) 

Let  us  start  by  showing  that  the  righthand  condition  on  ( u ,  v )  implies  the  lefthand 
condition.  Suppose  ||u||*  <  v,  and  ||x||  <  t  for  some  t  >  0.  (If  t  =  0,  x  must  be  zero, 
so  obviously  uT x  +  vt  >  0.)  Applying  the  definition  of  the  dual  norm,  and  the  fact 
that  ||—  x/t||  <  1,  we  have 

uT(—x/t)  <  ||u||.  <  v, 

and  therefore  uT x  +  vt  >  0. 

Next  we  show  that  the  lefthand  condition  in  (2.20)  implies  the  righthand  condition 
in  (2.20).  Suppose  ||u||*  >  v,  i.e.,  that  the  righthand  condition  does  not  hold.  Then 
by  the  definition  of  the  dual  norm,  there  exists  an  x  with  ||x||  <  1  and  xTu  >  v. 
Taking  t  =  1,  we  have 

uT {— x)  +  v  <  0, 

which  contradicts  the  lefthand  condition  in  (2.20). 


Dual  cones  satisfy  several  properties,  such  as: 

•  K*  is  closed  and  convex. 

•  K\  C  K2  implies  C  Kf. 

•  If  K  has  nonempty  interior,  then  K*  is  pointed. 

•  If  the  closure  of  K  is  pointed  then  K*  has  nonempty  interior. 

•  K**  is  the  closure  of  the  convex  hull  of  K.  (Hence  if  K  is  convex  and  closed, 
K**  =  K.) 

(See  exercise  2.31.)  These  properties  show  that  if  K  is  a  proper  cone,  then  so  is  its 
dual  A'*,  and  moreover,  that  K**  =  K. 

2.6.2  Dual  generalized  inequalities 

Now  suppose  that  the  convex  cone  K  is  proper,  so  it  induces  a  generalized  inequality 
-<k-  Then  its  dual  cone  K*  is  also  proper,  and  therefore  induces  a  generalized 
inequality.  We  refer  to  the  generalized  inequality  as  the  dual  of  the  generalized 
inequality  ~<k- 

Some  important  properties  relating  a  generalized  inequality  and  its  dual  are: 

•  x  A K  y  if  and  only  if  \T x  <  A Ty  for  all  A  0. 

•  x  -<k  y  if  and  only  if  XT x  <  A Ty  for  all  A  0,  A  ^  0. 

Since  K  =  K**,  the  dual  generalized  inequality  associated  with  is  ^ k ,  so 
these  properties  hold  if  the  generalized  inequality  and  its  dual  are  swapped.  As  a 
specific  example,  we  have  A  ^  if  and  only  if  XT x  <  yTx  for  all  x  0. 
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Example  2.26  Theorem  of  alternatives  for  linear  strict  generalized  inequalities.  Sup¬ 
pose  A  C  Rm  is  a  proper  cone.  Consider  the  strict  generalized  inequality 

Ax  -<K  b,  (2.21) 


where  x  £  Rn. 

We  will  derive  a  theorem  of  alternatives  for  this  inequality.  Suppose  it  is  infeasible, 
i.e.,  the  affine  set  { b  —  Ax  \  x  £  R’1}  does  not  intersect  the  open  convex  set  int  K. 
Then  there  is  a  separating  hyperplane,  i.e.,  a  nonzero  A  £  Rm  and  p  £  R  such  that 
A T(b  —  Ax)  <  /x  for  all  x,  and  \T y  >  y  for  all  y  £  int  A'.  The  first  condition  implies 
AT \  =  0  and  A Tb  <  y.  The  second  condition  implies  A Ty  >  y  for  all  y  £  K,  which 
can  only  happen  if  A  £  K*  and  y  <  0. 

Putting  it  all  together  we  find  that  if  (2.21)  is  infeasible,  then  there  exists  A  such  that 

A  ^  0,  A  0,  At\  =  0,  A Tb  <  0.  (2.22) 

Now  we  show  the  converse:  if  (2.22)  holds,  then  the  inequality  system  (2.21)  cannot 
be  feasible.  Suppose  that  both  inequality  systems  hold.  Then  we  have  A T  (b  —  Ax)  > 
0,  since  A  ^  0,  A  >zk*  0,  and  b  —  Ax  >~k  0.  But  using  AT A  =  0  we  find  that 
A T(b  —  Ax)  =  A Tb  <  0,  which  is  a  contradiction. 

Thus,  the  inequality  systems  (2.21)  and  (2.22)  are  alternatives:  for  any  data  A,  b, 
exactly  one  of  them  is  feasible.  (This  generalizes  the  alternatives  (2.17),  (2.18)  for 
the  special  case  A  =  R+.) 


2.6.3  Minimum  and  minimal  elements  via  dual  inequalities 

We  can  use  dual  generalized  inequalities  to  characterize  minimum  and  minimal 
elements  of  a  (possibly  nonconvex)  set  S  C  Rm  with  respect  to  the  generalized 
inequality  induced  by  a  proper  cone  K. 

Dual  characterization  of  minimum  element 

We  first  consider  a  characterization  of  the  minimum  element:  x  is  the  minimum 
element  of  S,  with  respect  to  the  generalized  inequality  zzk-,  if  and  only  if  for  all 
A  >-jc*  0,  x  is  the  unique  minimizer  of  A T z  over  z  £  S.  Geometrically,  this  means 
that  for  any  A  'z-k*  0,  the  hyperplane 

{z  |  A T(z  —  x)  =  0} 

is  a  strict  supporting  hyperplane  to  S  at  x.  (By  strict  supporting  hyperplane,  we 
mean  that  the  hyperplane  intersects  S  only  at  the  point  x.)  Note  that  convexity 
of  the  set  S  is  not  required.  This  is  illustrated  in  figure  2.23. 

To  show  this  result,  suppose  x  is  the  minimum  element  of  S,  i.e.,  x  <k  z  for 
all  z  £  S,  and  let  A  >~k *  0.  Let  z  £  S ,  z  ^  x.  Since  x  is  the  minimum  element  of 
S,  we  have  z  —  x  >zk  0.  From  A  >- /v*  0  and  z  —  x  >zk  0,  z  —  x  ^  0,  we  conclude 
AT(z  —  x)  >0.  Since  z  is  an  arbitrary  element  of  S,  not  equal  to  x,  this  shows 
that  x  is  the  unique  minimizer  of  A Tz  over  z  £  S.  Conversely,  suppose  that  for  all 
A  0,  x  is  the  unique  minimizer  of  A T z  over  z  £  S,  but  x  is  not  the  minimum 
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Figure  2.23  Dual  characterization  of  minimum  element.  The  point  x  is  the 
minimum  element  of  the  set  S  with  respect  to  R+.  This  is  equivalent  to: 
for  every  A  >-  0,  the  hyperplane  {z  \  X T(z  —  x)  =  0}  strictly  supports  S  at 
x,  i.e.,  contains  S  on  one  side,  and  touches  it  only  at  x. 


element  of  S.  Then  there  exists  z  €  S  with  2  x.  Since  z  —  x  >^k  0,  there  exists 
A  0  with  A T(z  —  x)  <  0.  Hence  A T(z  —  x)  <  0  for  A  0  in  the  neighborhood 
of  A.  This  contradicts  the  assumption  that  x  is  the  unique  minimizer  of  A T z  over 

S. 

Dual  characterization  of  minimal  elements 

We  now  turn  to  a  similar  characterization  of  minimal  elements.  Here  there  is  a  gap 
between  the  necessary  and  sufficient  conditions.  If  A  >~k*  0  and  x  minimizes  A T z 
over  z  £  S,  then  x  is  minimal.  This  is  illustrated  in  figure  2.24. 

To  show  this,  suppose  that  A  >~k *  0,  and  x  minimizes  A Tz  over  S,  but  x  is  not 
minimal,  i.e.,  there  exists  a  2  £  S,  2  /  1,  and  2  x.  Then  XT (x  —  2)  >  0,  which 
contradicts  our  assumption  that  x  is  the  minimizer  of  A T z  over  S. 

The  converse  is  in  general  false:  a  point  x  can  be  minimal  in  S,  but  not  a 
minimizer  of  A Tz  over  2  £  S,  for  any  A,  as  shown  in  figure  2.25.  This  figure 
suggests  that  convexity  plays  an  important  role  in  the  converse,  which  is  correct. 
Provided  the  set  S  is  convex,  we  can  say  that  for  any  minimal  element  x  there 
exists  a  nonzero  A  0  such  that  x  minimizes  A T z  over  2  £  S. 

To  show  this,  suppose  x  is  minimal,  which  means  that  ((x  —  K )  \  {a:})  IT  S  =  0. 
Applying  the  separating  hyperplane  theorem  to  the  convex  sets  (x  —  K )  \  {x}  and 
S ,  we  conclude  that  there  is  a  A  7^  0  and  \i  such  that  AT  (x  —  y)  <  /z  for  all  y  £  I\, 
and  A T z  >  (j  for  all  z  £  S.  From  the  first  inequality  we  conclude  A  ^ k *  0.  Since 
x  £  S  and  x  £  x  —  K ,  we  have  XT x  =  /j,  so  the  second  inequality  implies  that  /r 
is  the  minimum  value  of  A T z  over  S.  Therefore,  a:  is  a  minimizer  of  A T z  over  S, 
where  A  7^  0,  A  0. 

This  converse  theorem  cannot  be  strengthened  to  A  >~k *  0-  Examples  show 
that  a  point  x  can  be  a  minimal  point  of  a  convex  set  S,  but  not  a  minimizer  of 
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Figure  2.24  A  set  S  C  R2.  Its  set  of  minimal  points,  with  respect  to  R+,  is 
shown  as  the  darker  section  of  its  (lower,  left)  boundary.  The  minimizer  of 
A i  z  over  S  is  an,  and  is  minimal  since  Ai  >-  0.  The  minimizer  of  A Jz  over 
S  is  X2,  which  is  another  minimal  point  of  S,  since  A2  >-  0. 


Figure  2.25  The  point  x  is  a  minimal  element  of  S  C  R2  with  respect  to 
R+.  However  there  exists  no  A  for  which  x  minimizes  A Tz  over  z  £  S. 


2.6  Dual  cones  and  generalized  inequalities 
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S2 


x2 

Figure  2.26  Left.  The  point  xi  £  Si  is  minimal,  but  is  not  a  minimizer  of 
A Tz  over  Si  for  any  A  >-  0.  (It  does,  however,  minimize  A T z  over  z  £  Si  for 
A  =  (1,0).)  Right.  The  point  x2  £  S2  is  not  minimal,  but  it  does  minimize 
A T z  over  z  £  S2  for  A  =  (0, 1)  y  0. 


A T z  over  z  £  S  for  any  A  >-#-*  0.  (See  figure  2.26,  left.)  Nor  is  it  true  that  any 
minimizer  of  A T z  over  z  £  S,  with  A  0,  is  minimal  (see  figure  2.26,  right.) 


Example  2.27  Pareto  optimal  production  frontier.  We  consider  a  product  which 
requires  n  resources  (such  as  labor,  electricity,  natural  gas,  water)  to  manufacture. 
The  product  can  be  manufactured  or  produced  in  many  ways.  With  each  production 
method,  we  associate  a  resource  vector  x  £  R",  where  X,  denotes  the  amount  of 
resource  i  consumed  by  the  method  to  manufacture  the  product.  We  assume  that  Xi  > 
0  (i.e.,  resources  are  consumed  by  the  production  methods)  and  that  the  resources 
are  valuable  (so  using  less  of  any  resource  is  preferred). 

The  production  set  P  C  Rn  is  defined  as  the  set  of  all  resource  vectors  x  that 
correspond  to  some  production  method. 

Production  methods  with  resource  vectors  that  are  minimal  elements  of  P,  with 
respect  to  componentwise  inequality,  are  called  Pareto  optimal  or  efficient.  The  set 
of  minimal  elements  of  P  is  called  the  efficient  production  frontier. 

We  can  give  a  simple  interpretation  of  Pareto  optimality.  We  say  that  one  production 
method,  with  resource  vector  x,  is  better  than  another,  with  resource  vector  y,  if 
Xi  <  yi  for  all  i,  and  for  some  i.  x%  <  yi-  In  other  words,  one  production  method 
is  better  than  another  if  it  uses  no  more  of  each  resource  than  another  method,  and 
for  at  least  one  resource,  actually  uses  less.  This  corresponds  to  x  <  y,  x  ^  y.  Then 
we  can  say:  A  production  method  is  Pareto  optimal  or  efficient  if  there  is  no  better 
production  method. 

We  can  find  Pareto  optimal  production  methods  (i.e.,  minimal  resource  vectors)  by 
minimizing 

XTX  =  Al*l  H - b  A nXn 

over  the  set  P  of  production  vectors,  using  any  A  that  satisfies  A  >-  0. 

Here  the  vector  A  has  a  simple  interpretation:  A ;  is  the  price  of  resource  i.  By 
minimizing  \T x  over  P  we  are  finding  the  overall  cheapest  production  method  (for 
the  resource  prices  A ,).  As  long  as  the  prices  are  positive,  the  resulting  production 
method  is  guaranteed  to  be  efficient. 

These  ideas  are  illustrated  in  figure  2.27. 
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fuel 


labor 


Figure  2.27  The  production  set  P,  for  a  product  that  requires  labor  and 
fuel  to  produce,  is  shown  shaded.  The  two  dark  curves  show  the  efficient 
production  frontier.  The  points  xi,  X2  and  X3  are  efficient.  The  points  xa 
and  xs  are  not  (since  in  particular,  X2  corresponds  to  a  production  method 
that  uses  no  more  fuel,  and  less  labor).  The  point  xi  is  also  the  minimum 
cost  production  method  for  the  price  vector  A  (which  is  positive).  The  point 
X2  is  efficient,  but  cannot  be  found  by  minimizing  the  total  cost  XT x  for  any 
price  vector  A  >;  0. 


Bibliography 
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Exercises 

Definition  of  convexity 

2.1  Let  C  C  R"  be  a  convex  set,  with  xi, . . . , Xk  £  C,  and  let  9i, ...  ,9k  £  R  satisfy  9t  >  0, 
9i  +  ■  ■  •  +  9k  =  1.  Show  that  6*1X1  +  •  •  •  +  9kXk  £  C.  (The  definition  of  convexity  is  that 
this  holds  for  k  =  2;  you  must  show  it  for  arbitrary  k.)  Hint.  Use  induction  on  k. 

2.2  Show  that  a  set  is  convex  if  and  only  if  its  intersection  with  any  line  is  convex.  Show  that 
a  set  is  affine  if  and  only  if  its  intersection  with  any  line  is  affine. 

2.3  Midpoint  convexity.  A  set  C  is  midpoint  convex  if  whenever  two  points  a ,  b  are  in  C,  the 
average  or  midpoint  (a  +  6)/2  is  in  C.  Obviously  a  convex  set  is  midpoint  convex.  It  can 
be  proved  that  under  mild  conditions  midpoint  convexity  implies  convexity.  As  a  simple 
case,  prove  that  if  C  is  closed  and  midpoint  convex,  then  C  is  convex. 

2.4  Show  that  the  convex  hull  of  a  set  S  is  the  intersection  of  all  convex  sets  that  contain  S. 
(The  same  method  can  be  used  to  show  that  the  conic,  or  affine,  or  linear  hull  of  a  set  S 
is  the  intersection  of  all  conic  sets,  or  affine  sets,  or  subspaces  that  contain  S.) 

Examples 

2.5  What  is  the  distance  between  two  parallel  hyperplanes  {x  £  Rn  |  aTx  =  61}  and  {x  £ 
R"  |  aTx  =  &2}? 

2.6  When  does  one  halfspace  contain  another ?  Give  conditions  under  which 

{x  |  aT x  <  b}  C  {x  |  aT x  <  6} 

(where  a  0,  a  ^  0).  Also  find  the  conditions  under  which  the  two  halfspaces  are  equal. 

2.7  Voronoi  description  of  halfspace.  Let  a  and  b  be  distinct  points  in  R".  Show  that  the  set 
of  all  points  that  are  closer  (in  Euclidean  norm)  to  a  than  b,  i.e.,  {x  \  \\x  —  a,|| 2  <  ||a;  —  6|| 2 } , 
is  a  halfspace.  Describe  it  explicitly  as  an  inequality  of  the  form  cTx  <  d.  Draw  a  picture. 

2.8  Which  of  the  following  sets  S  are  polyhedra?  If  possible,  express  S  in  the  form  S  = 
{x  |  Ax  X  b,  Fx  =  g}. 

(a)  S  =  (j/iai  +  1/2(12  |  —  1  <  2/1  <  1,  —  1  <  j/2  <  1 } ,  where  01,  02  £  R". 

(b)  S  =  {x  £  R"  |  x  X  0,  lTx  =  1,  ZtiXiOi  =  bi,  ixi°a  =  62},  where 
ai, ...  ,an  £  R  and  61 , 62  £  R. 

(c)  S  =  {x  £  R"  |  x  y  0,  xTy  <  1  for  all  y  with  \\yW2  =  1}. 

(d)  S  =  {x  £  R"  |  x  y  0,  xTy  <  1  for  all  y  with  \yi\  =  1}. 

2.9  Voronoi  sets  and  polyhedral  decomposition.  Let  xo ,...,xk  £  Rn-  Consider  the  set  of 
points  that  are  closer  (in  Euclidean  norm)  to  xo  than  the  other  Xi,  i.e., 

V  =  {x  £  R"  |  \\x  —  xo || 2  <  ||x  —  Xi 1 1 2 ,  i  =  1, . . . ,  A'}. 

V  is  called  the  Voronoi  region  around  xo  with  respect  to  xi, . . . ,  xk- 

(a)  Show  that  V  is  a  polyhedron.  Express  V  in  the  form  V  =  {x  |  Ax  X  &}. 

(b)  Conversely,  given  a  polyhedron  P  with  nonempty  interior,  show  how  to  find  xo, . . . ,  xk 
so  that  the  polyhedron  is  the  Voronoi  region  of  Xo  with  respect  to  Xi, . . . ,  Xk- 

(c)  We  can  also  consider  the  sets 

14  =  {x  £  R"  |  ||x  -  Xfc ||2  <  ||x  -  x»||2,  i  ^  k}. 

The  set  14  consists  of  points  in  Rn  for  which  the  closest  point  in  the  set  {xo, . . . ,  xk} 
is  Xk  ■ 
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The  sets  Vo, ... ,  Vk  give  a  polyhedral  decomposition  of  R".  More  precisely,  the  sets 
14  are  polyhedra,  (Jfc_Q  14  =  Rn,  and  int  Vi  D  int  Vj  =  0  for  i  ^  j,  i.e.,  Vi  and  Vj 
intersect  at  most  along  a  boundary. 

Suppose  that  Pi, ,  Pm  are  polyhedra  such  that  |J ™=1Pi  =  R",  and  int  ^  D 
int  Pj  =  0  for  i  ^  j.  Can  this  polyhedral  decomposition  of  R"  be  described  as 
the  Voronoi  regions  generated  by  an  appropriate  set  of  points? 

2.10  Solution  set  of  a  quadratic  inequality.  Let  C  C  R"  be  the  solution  set  of  a  quadratic 
inequality, 

C  =  {i£  R71  |  xT  Ax  +  bT x  +  c  <  0}, 
with  A  £  S",  b  £  R",  and  c  €  R. 

(a)  Show  that  C  is  convex  if  A  X  0. 

(b)  Show  that  the  intersection  of  C  and  the  hyperplane  defined  by  gTx  +  h  =  0  (where 
g  ^  0)  is  convex  if  A  +  A ggT  X  0  for  some  A  £  R. 

Are  the  converses  of  these  statements  true? 

2.11  Hyperbolic  sets.  Show  that  the  hyperbolic  set  {x  £  R+  |  X1X2  >  1}  is  convex.  As  a 

generalization,  show  that  {x  £  R"  |  ]~["=1  xt  >  1}  is  convex.  Hint.  If  a,  b  >  0  and 

0  <  0  <  1,  then  <  da  +  (1  —  6>)6;  see  §3.1.9. 

2.12  Which  of  the  following  sets  are  convex? 

(a)  A  slab,  i.e.,  a  set  of  the  form  {x  £  R"  |  a  <  aTx  <  p}. 

(b)  A  rectangle,  i.e.,  a  set  of  the  form  {x  £  R"  |  cq  <  Xi  <  /?;,  *  =  1 n}.  A  rectangle 

is  sometimes  called  a  hyperrectangle  when  n  >  2. 

(c)  A  wedge,  i.e.,  {x  £  R"  |  ajx  <  b\,  a%x  <  62}- 

(d)  The  set  of  points  closer  to  a  given  point  than  a  given  set,  i.e., 

{x  |  ||x  —  xo || 2  <  ||x  —  J/H2  for  all  y  £  S'} 

where  SCR". 

(e)  The  set  of  points  closer  to  one  set  than  another,  i.e., 

{x  |  dist(x,S)  <  dist(x,  T)}, 

where  S,  T  C  Rn ,  and 

dist(x,  S )  =  inf{||x  —  z\\2  \  z  £  S}. 

(f)  [HUL93,  volume  1,  page  93]  The  set  {x  |  x  +  S2  C  Si},  where  Si,  S2  C  R"  with  Si 
convex. 

(g)  The  set  of  points  whose  distance  to  a  does  not  exceed  a  fixed  fraction  9  of  the 
distance  to  b,  i.e.,  the  set  {x  |  ||x  —  a.|| 2  <  9\\x  —  &|| 2 } -  You  can  assume  a  ^  b  and 
0  <  9  <  1. 

2.13  Conic  hull  of  outer  products.  Consider  the  set  of  rank-fc  outer  products,  defined  as 
(A'At  I  X  £  R"xfc,  rankA'  =  k}.  Describe  its  conic  hull  in  simple  terms. 

2.14  Expanded  and  restricted  sets.  Let  S  C  R",  and  let  ||  -  ||  be  a.  norm  on  R". 

(a)  For  a  >  0  we  define  Sa  as  {x  |  dist(x,  S)  <  a},  where  dist(x,  S)  =  infy65  ||x  —  j/||. 
We  refer  to  S0  as  S  expanded  or  extended  by  a.  Show  that  if  S  is  convex,  then  Sa 
is  convex. 

(b)  For  a  >  0  we  define  S_a  =  {x  |  B(x,  a)  C  S } ,  where  B{x,  a)  is  the  ball  (in  the  norm 
||  •  ||),  centered  at  x,  with  radius  a.  We  refer  to  S_0  as  S  shrunk  or  restricted  by  a, 
since  S_a  consists  of  all  points  that  are  at  least  a  distance  a  from  R n\S.  Show  that 
if  S  is  convex,  then  S-a  is  convex. 
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2.15  Some  sets  of  probability  distributions.  Let  i  be  a  real-valued  random  variable  with 
prob(a:  =  at)  =  Pi,  i  =  1  ,  ...,n,  where  ai  <  02  <■■■  <  an.  Of  course  p  £  R’1  lies 
in  the  standard  probability  simplex  P  =  {p  |  1  Tp  =  1,  p  X  0}.  Which  of  the  following 
conditions  are  convex  in  pi  (That  is,  for  which  of  the  following  conditions  is  the  set  of 
p  £  P  that  satisfy  the  condition  convex?) 

(a)  a  <  E /(*)  <  P,  where  E  f(x)  is  the  expected  value  of  f(x),  i.e.,  E  f(x)  = 
Er=iPi/(«i)-  (The  function  f  :  R  — >  R  is  given.) 

(b)  prob(x  >  a)  <  fi. 

(c)  E  |a;3|  <  o-E  \x\. 

(d)  E  x2  <  a. 

(e)  E  x2  >  a. 

(f)  var(*)  <  a,  where  var(*)  =  E(*  —  Ex)2  is  the  variance  of  x. 

(g)  var(*)  >  a. 

(h)  quartile(a:)  >  a,  where  quartile(a:)  =  inf {/3  |  prob(x  <  ft)  >  0.25}. 

(i)  quartile(x)  <  a. 


Operations  that  preserve  convexity 

2.16  Show  that  if  Si  and  S2  are  convex  sets  in  Rm+n,  then  so  is  their  partial  sum 

S  =  {{x,yi  +  y2)  |  *  £  Rm,  yi,  y2  £  R",(a;,yi)  £  Si,  (x,y2)  £  S2}. 

2.17  Image  of  polyhedral  sets  under  perspective  function.  In  this  problem  we  study  the  image 
of  hyperplanes,  halfspaces,  and  polyhedra  under  the  perspective  function  P(x,t)  =  x/t, 
with  domP  =  R"  x  R++.  For  each  of  the  following  sets  C,  give  a  simple  description  of 

P(C)  =  {v/t  |  (v,t)  £  C,  t>  0}. 

(a)  The  polyhedron  C  =  conv{(n,  £1 ) , . . . ,  (vk ,  tic)}  where  Vi  £  R"  and  U  >  0. 

(b)  The  hyperplane  C  =  {(v,t)  \  fTv  +  gt  =  h}  (with  /  and  g  not  both  zero). 

(c)  The  halfspace  C  —  {(v,t)  \  fTv  +  gt  <  h}  (with  /  and  g  not  both  zero). 

(d)  The  polyhedron  C  =  {(v,  t)  \  Fv  +  gt  S  h.}. 

2.18  Invertible  linear- fractional  functions.  Let  /  :  R™  — >  Rn  be  the  linear- fractional  function 

f(x)  =  (Ax  +  b)/(cT x  +  d),  dom /  =  {x  \  cT x  +  d  >  0}. 

Suppose  the  matrix 


is  nonsingular.  Show  that  /  is  invertible  and  that  /-1  is  a  linear-fractional  mapping. 
Give  an  explicit  expression  for  / and  its  domain  in  terms  of  A,  b,  c,  and  d.  Hint.  It 
may  be  easier  to  express  /-1  in  terms  of  Q. 

2.19  Linear- fractional  functions  and  convex  sets.  Let  /  :  R"1  — »  R’1  be  the  linear- fractional 
function 

f(x)  =  (Ax  +  b)/(cTx  +  d),  dom /  =  {x  \  cT x  +  d  >  0}. 

In  this  problem  we  study  the  inverse  image  of  a  convex  set  C  under  /,  i.e., 

f~\C)  =  (a:  £  dom  /  |  f(x)  £  C}. 

For  each  of  the  following  sets  C  C  R",  give  a  simple  description  of  /~1(G). 
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(a)  The  halfspace  C  =  {y  \  gTy  <  h}  (with  g  ^  0). 

(b)  The  polyhedron  C  =  {y  \  Gy  -<  h}. 

(c)  The  ellipsoid  {y  \  yT P~1y  <  1}  (where  P  G  S"+). 

(d)  The  solution  set  of  a  linear  matrix  inequality,  C  =  {y  \  y\A\  +  ■  ■  ■  +  ynAn  X  B}, 
where  Ai,  . . . ,  An,  B  G  Sp. 

Separation  theorems  and  supporting  hyperplanes 

2.20  Strictly  positive  solution  of  linear  equations.  Suppose  A  G  Rmxn,  b  G  Rm,  with  b  G  7 Z(A). 
Show  that  there  exists  an  x  satisfying 

x  y  0,  Ax  =  b 

if  and  only  if  there  exists  no  A  with 

AtX  y  0,  AtA  +  0,  bT X  <  0. 

Hint.  First  prove  the  following  fact  from  linear  algebra:  cTx  =  d  for  all  x  satisfying 
Ax  =  b  if  and  only  if  there  is  a  vector  A  such  that  c  =  AlTA,  d  =  bT X. 

2.21  The  set  of  separating  hyperplanes.  Suppose  that  C  and  D  are  disjoint  subsets  of  Rn. 
Consider  the  set  of  (a,  b)  G  R"'+1  for  which  aTx  <  b  for  all  x  G  C,  and  aTx  >  b  for  all 
x  G  D.  Show  that  this  set  is  a  convex  cone  (which  is  the  singleton  {0}  if  there  is  no 
hyperplane  that  separates  C  and  D). 

2.22  Finish  the  proof  of  the  separating  hyperplane  theorem  in  §2.5.1:  Show  that  a  separating 
hyperplane  exists  for  two  disjoint  convex  sets  C  and  D.  You  can  use  the  result  proved 
in  §2.5.1,  i.e.,  that  a  separating  hyperplane  exists  when  there  exist  points  in  the  two  sets 
whose  distance  is  equal  to  the  distance  between  the  two  sets. 

Hint.  If  C  and  D  are  disjoint  convex  sets,  then  the  set  {x  —  y  \  x  G  C,  y  G  D}  is  convex 
and  does  not  contain  the  origin. 

2.23  Give  an  example  of  two  closed  convex  sets  that  are  disjoint  but  cannot  be  strictly  sepa¬ 
rated. 

2.24  Supporting  hyperplanes. 

(a)  Express  the  closed  convex  set  {x  G  R+  |  X1X2  >  1}  as  an  intersection  of  halfspaces. 

(b)  Let  C  =  {x  G  Rn  |  ||a:||oo  <  1},  the  foo-norm  unit  ball  in  R",  and  let  x  be  a  point 
in  the  boundary  of  C.  Identify  the  supporting  hyperplanes  of  C  at  x  explicitly. 

2.25  Inner  and  outer  polyhedral  approximations.  Let  C  C  R"  be  a  closed  convex  set,  and 
suppose  that  xi, . . . ,  xk  are  on  the  boundary  of  C.  Suppose  that  for  each  i,  af  ( x  —  Xi )  =  0 
defines  a  supporting  hyperplane  for  C  at  Xi,  i.e.,  C  C  {x  \  aj (x  —  Xi)  <  0}.  Consider  the 
two  polyhedra 

-Pinner  =  COnvjll,.  .  .  ,  X  K} ,  Router  =  {x  \  aj  (x  -  Xi)  <  0,  i=  1,  .  .  .  ,  A'}. 

Show  that  Rnner  §CC  Pouter-  Draw  a,  picture  illustrating  this. 

2.26  Support  function.  The  support  function  of  a  set  C  C  R"  is  defined  as 

Sc(y)  =  sup{yTx  |igC}. 

(We  allow  Sc(y)  to  take  on  the  value  +oo.)  Suppose  that  C  and  D  are  closed  convex  sets 
in  R".  Show  that  C  =  D  if  and  only  if  their  support  functions  are  equal. 

2.27  Converse  supporting  hyperplane  theorem.  Suppose  the  set  C  is  closed,  has  nonempty 
interior,  and  has  a  supporting  hyperplane  at  every  point  in  its  boundary.  Show  that  C  is 
convex. 
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Convex  cones  and  generalized  inequalities 

2.28  Positive  semidefinite  cone  for  n  =  1,  2,  3.  Give  an  explicit  description  of  the  positive 
semidefinite  cone  S",  in  terms  of  the  matrix  coefficients  and  ordinary  inequalities,  for 
n  =  1,  2,  3.  To  describe  a  general  element  of  Sn,  for  n  =  1,  2,  3,  use  the  notation 


r 

-|  Xl 

X2 

X3 

Xl 

X2 

Xl, 

,  X2 

X4 

Xs 

X2 

X3 

L 

J  x3 

X5 

Xq 

2.29  Cones  in  R2.  Suppose  K  C  R2  is  a  closed  convex  cone. 

(a)  Give  a  simple  description  of  K  in  terms  of  the  polar  coordinates  of  its  elements 
(x  =  r(cos  cp,  sin  <jj)  with  r  >  0). 

(b)  Give  a  simple  description  of  K* ,  and  draw  a  plot  illustrating  the  relation  between 
K  aiid  A*. 

(c)  When  is  K  pointed? 

(d)  When  is  K  proper  (hence,  defines  a  generalized  inequality)?  Draw  a  plot  illustrating 
what  x  <k  y  means  when  K  is  proper. 

2.30  Properties  of  generalized  inequalities.  Prove  the  properties  of  (nonstrict  and  strict)  gen¬ 
eralized  inequalities  listed  in  §2.4.1. 

2.31  Properties  of  dual  cones.  Let  K*  be  the  dual  cone  of  a  convex  cone  K,  as  defined  in  (2.19). 
Prove  the  following. 

(a)  K*  is  indeed  a  convex  cone. 

(b)  A'i  C  K2  implies  A|  C  K{. 

(c)  K*  is  closed. 

(d)  The  interior  of  K*  is  given  by  int  A*  =  {y  \  yTx  >  0  for  all  x  E  cl  A'}. 

(e)  If  K  has  nonempty  interior  then  K*  is  pointed. 

(f)  A'**  is  the  closure  of  K.  (Hence  if  K  is  closed,  A'**  =  A.) 

(g)  If  the  closure  of  K  is  pointed  then  K*  has  nonempty  interior. 

2.32  Find  the  dual  cone  of  {Ax  \  x  P  0},  where  A  £  Rmx". 

2.33  The  monotone  nonnegative  cone.  We  define  the  monotone  nonnegative  cone  as 

A'm+  =  {x  £  Rn  \  xi  >  x2  >■■■>  x„  >  0} . 
i.e.,  all  nonnegative  vectors  with  components  sorted  in  nonincreasing  order. 

(a)  Show  that  Am+  is  a  proper  cone. 

(b)  Find  the  dual  cone  K^+.  Hint.  Use  the  identity 

n 

^2 XiVi  =  (*1  -  *2)1/1  +  (X2  -  £3)(2/l  +  yz)  +  (X3  ~  £4)(2/l  +  2/2  +  t/3)  H - 

i= 1 

+  (xn-i  -  Xn)(y  1  -I - h  y„- 1)  +  xn(yi  H - b  yn). 

2.34  The  lexicographic  cone  and  ordering.  The  lexicographic  cone  is  defined  as 

A'lex  =  {0}  U  {*  £  R"  |  xi  =  ■  ■  ■  =  Xk  =  0,  Xk+i  >  0,  for  some  fc,  0  <  k  <  n}, 
i.e.,  all  vectors  whose  first  nonzero  coefficient  (if  any)  is  positive. 

(a)  Verify  that  A'iex  is  a  cone,  but  not  a  proper  cone. 
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(b)  We  define  the  lexicographic  ordering  on  R"  as  follows:  x  <iex  V  if  and  only  if 
y  —  x  £  A'iex-  (Since  A'iex  is  not  a  proper  cone,  the  lexicographic  ordering  is  not  a 
generalized  inequality.)  Show  that  the  lexicographic  ordering  is  a  linear  ordering-. 
for  any  x,  y  &  Rn,  either  x  <iex  y  or  y  <iex  x.  Therefore  any  set  of  vectors  can  be 
sorted  with  respect  to  the  lexicographic  cone,  which  yields  the  familiar  sorting  used 
in  dictionaries. 

(c)  Find  A'fex. 

2.35  Copositive  matrices.  A  matrix  X  £  S"  is  called  copositive  if  zT Xz  >  0  for  all  z  Y  0. 
Verify  that  the  set  of  copositive  matrices  is  a  proper  cone.  Find  its  dual  cone. 

2.36  Euclidean  distance  matrices.  Let  xi, . . .  ,xn  £  R-fc-  The  matrix  D  £  Sn  defined  by  D,,  = 
|| Xi  —  Xj  || I  is  called  a  Euclidean  distance  matrix.  It  satisfies  some  obvious  properties  such 
as  Dij  =  Dji,  Da  =  0,  Dij  >  0,  and  (from  the  triangle  inequality)  D\l2  <  D\(2  +  E>XJ2. 
We  now  pose  the  question:  When  is  a  matrix  D  £  Sn  a  Euclidean  distance  matrix  (for 
some  points  in  Rfc,  for  some  fc)?  A  famous  result  answers  this  question:  D  £  Sn  is  a 
Euclidean  distance  matrix  if  and  only  if  Da  =  0  and  xT Dx  <  0  for  all  x  with  lTx  =  0. 
(See  §8.3.3.) 

Show  that  the  set  of  Euclidean  distance  matrices  is  a  convex  cone. 

2.37  Nonnegative  polynomials  and  Hankel  LMIs.  Let  A'poi  be  the  set  of  (coefficients  of)  non¬ 
negative  polynomials  of  degree  2k  on  R: 

A'poi  =  {x  £  R2fe+1  |  xi  +  X2 1  +  *3 12  +  ■  ■  •  +  X2k+it2k  >  0  for  all  t  £  R}. 

(a)  Show  that  A'poi  is  a  proper  cone. 

(b)  A  basic  result  states  that  a  polynomial  of  degree  2k  is  nonnegative  on  R  if  and  only 
if  it  can  be  expressed  as  the  sum  of  squares  of  two  polynomials  of  degree  k  or  less. 
In  other  words,  x  £  A'poi  if  and  only  if  the  polynomial 

p{t)  =  xi  +  X2 1  +  X3 12  H - +  X2k+it2k 


can  be  expressed  as 

pit)  =  r(t)2  +  s(t)2, 

where  r  and  s  are  polynomials  of  degree  k. 

Use  this  result  to  show  that 


A'poi 


lx  £  R2fc+1 

Xi  =  ^  Ymn  for  some  Y  £  S!ji+1  j 

l 

m-\-n=i-\- 1  ) 

In  other  words,  p(t)  =  X\  +  *2 1  +  x^t2  H —  •  +  X2k+it2k  is  nonnegative  if  and  only  if 
there  exists  a  matrix  Y  £  S*+1  such  that 

xi  =  I'll 

X2  =  I '12  +  Y21 

X3  =  V13  +  1 22  +  V31 

*2fc+l  =  Ifc+l,fc+l- 

(c)  Show  that  A'*oi  =  A'han  where 

A'han  =  {Z  £  R2fc+1  |  H(Z)  Y  0} 
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and 

Zl  Z2  23  '  '  '  Zk  Zk+1 

22  23  24  '  '  '  Zk+1  Zk+2 

23  24  25  •  •  •  Zk+2  Zk+4 

......  i 

Zk  %k-\- 1  Zk+2  '  '  '  Z2k  —  1  Z2k 

Zk+1  Zk+2  Zk+ 3  •  *  •  Z2k  Z2k+1 

(This  is  the  Hankel  matrix  with  coefficients  21, ,  Z2k+i-) 

(d)  Let  A'mom  be  the  conic  hull  of  the  set  of  all  vectors  of  the  form  (l,t,  t2,. ...  ,t2k), 
where  t  £  R.  Show  that  y  £  A'mom  if  and  only  if  yi  >  0  and 

y  =  j/i(l,  Eb,  Em2,  . . . ,  E u2k) 

for  some  random  variable  it.  In  other  words,  the  elements  of  A'mom  are  nonnegative 
multiples  of  the  moment  vectors  of  all  possible  distributions  on  R.  Show  that  A'poi  = 
K* 

-1  *-  mom  • 

(e)  Combining  the  results  of  (c)  and  (d) ,  conclude  that  A'han  =  cl  A'mom . 

As  an  example  illustrating  the  relation  between  A'mom  and  A'han,  take  k  =  2  and 
z  =  (1,0,  0,0,1).  Show  that  2  £  A'han,  2  0  A'mom.  Find  an  explicit  sequence  of 
points  in  A'mom  which  converge  to  2. 

2.38  [Roc70,  pages  15,  61]  Convex  cones  constructed  from  sets. 

(a)  The  barrier  cone  of  a  set  C  is  defined  as  the  set  of  all  vectors  y  such  that  yT x  is 
bounded  above  over  x  £  C.  In  other  words,  a  nonzero  vector  y  is  in  the  barrier  cone 
if  and  only  if  it  is  the  normal  vector  of  a  halfspace  {x  \  yT x  <  a}  that  contains  C. 
Verify  that  the  barrier  cone  is  a  convex  cone  (with  no  assumptions  on  C). 

(b)  The  recession  cone  (also  called  asymptotic  cone)  of  a  set  C  is  defined  as  the  set  of 
all  vectors  y  such  that  for  each  x  £  C,  x  —  ty  £  C  for  all  t  >  0.  Show  that  the 
recession  cone  of  a  convex  set  is  a  convex  cone.  Show  that  if  C  is  nonempty,  closed, 
and  convex,  then  the  recession  cone  of  C  is  the  dual  of  the  barrier  cone. 

(c)  The  normal  cone  of  a  set  C  at  a  boundary  point  xo  is  the  set  of  all  vectors  y  such 
that  yT(x  —  xo)  <  0  for  all  x  £  C  (i.e.,  the  set  of  vectors  that  define  a  supporting 
hyperplane  to  C  at  xo).  Show  that  the  normal  cone  is  a  convex  cone  (with  no 
assumptions  on  C).  Give  a  simple  description  of  the  normal  cone  of  a  polyhedron 
(x  |  Ax  ■<  6}  at  a  point  in  its  boundary. 

2.39  Separation  of  cones.  Let  A'  and  K  be  two  convex  cones  whose  interiors  are  nonempty  and 
disjoint.  Show  that  there  is  a  nonzero  y  such  that  y  £  A'*,  —y  £  A'*. 


Chapter  3 

Convex  functions 


3.1  Basic  properties  and  examples 

3.1.1  Definition 

A  function  /  :  Rra  — >  R  is  convex  if  dom/  is  a  convex  set  and  if  for  all  x, 
y  £  dom  /,  and  6  with  0  <  9  <  1,  we  have 


f{dx  +  (1  -  6)y)  <  9f(x)  +  (1  -  9)f(y).  (3.1) 

Geometrically,  this  inequality  means  that  the  line  segment  between  (x,  /(x))  and 
(y,f{y)),  which  is  the  chord  from  x  to  y,  lies  above  the  graph  of  /  (figure  3.1). 
A  function  /  is  strictly  convex  if  strict  inequality  holds  in  (3.1)  whenever  i/j 
and  0  <  9  <  1.  We  say  1/  is  concave  if  — /  is  convex,  and  strictly  concave  if  — /  is 
strictly  convex. 

For  an  affine  function  we  always  have  equality  in  (3.1),  so  all  affine  (and  therefore 
also  linear)  functions  are  both  convex  and  concave.  Conversely,  any  function  that 
is  convex  and  concave  is  affine. 

A  function  is  convex  if  and  only  if  it  is  convex  when  restricted  to  any  line  that 
intersects  its  domain.  In  other  words  /  is  convex  if  and  only  if  for  all  x  £  dom  /  and 


Figure  3.1  Graph  of  a  convex  function.  The  chord  (i.e.,  line  segment)  be¬ 
tween  any  two  points  on  the  graph  lies  above  the  graph. 
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all  v,  the  function  g(t)  =  f(x  +  tv)  is  convex  (on  its  domain,  {t  \  x  +  tv  £  dom/}). 
This  property  is  very  useful,  since  it  allows  us  to  check  whether  a  function  is  convex 
■  restricting  it  to  a  line. 

The  analysis  of  convex  functions  is  a  well  developed  field,  which  we  will  not 
pursue  in  any  depth.  One  simple  result,  for  example,  is  that  a  convex  function  is 
continuous  on  the  relative  interior  of  its  domain;  it  can  have  discontinuities  only 
on  its  relative  boundary. 


3.1.2  Extended-value  extensions 

It  is  often  convenient  to  extend  a  convex  function  to  all  of  Rn  by  defining  its  value 
to  be  oo  outside  its  domain.  If  /  is  convex  we  define  its  extended-value  extension 
f  :  R"  ->RU  {oo}  by 


fix) 


The  extension  /  is  defined  on  all  Rra,  and  takes  values  in  RU  {oo}.  We  can  recover 
the  domain  of  the  original  function  /  from  the  extension  /  as  dom  f  =  {x\  /( x)  < 
oo}. 

The  extension  can  simplify  notation,  since  we  do  not  need  to  explicitly  describe 
the  domain,  or  add  the  qualifier  ‘for  all  x  £  dom/’  every  time  we  refer  to  f(x). 
Consider,  for  example,  the  basic  defining  inequality  (3.1).  In  terms  of  the  extension 
/,  we  can  express  it  as:  for  0  <  9  <  1, 

f(9x  +  (1  -  6)y)  <  9f{x)  +  (1  -  9)f(y) 

for  any  x  and  y.  (For  9  =  0  or  9  =  1  the  inequality  always  holds.)  Of  course  here  we 
must  interpret  the  inequality  using  extended  arithmetic  and  ordering.  For  x  and  y 
both  in  dom/,  this  inequality  coincides  with  (3.1);  if  either  is  outside  dom/,  then 
the  righthand  side  is  oo,  and  the  inequality  therefore  holds.  As  another  example 
of  this  notational  device,  suppose  fi  and  /2  are  two  convex  functions  on  R™.  The 
pointwise  sum  /  =  f1  +  /2  is  the  function  with  domain  dom  /  =  dom  fi  0  dom  /2, 
with  f(x)  =  fi(x)  +  f2{x)  for  any  x  £  dom/.  Using  extended- value  extensions  we 
can  simply  say  that  for  any  x ,  /( x)  =  fi(x)  +  f2(x).  In  this  equation  the  domain 
of  /  has  been  automatically  defined  as  dom  /  =  dom  /i  fl  dom  /2,  since  f(x)  =  oo 
whenever  x  qL  dom/i  or  x  ^  dom/2.  In  this  example  we  are  relying  on  extended 
arithmetic  to  automatically  define  the  domain. 

In  this  book  we  will  use  the  same  symbol  to  denote  a  convex  function  and  its 
extension,  whenever  there  is  no  harm  from  the  ambiguity.  This  is  the  same  as 
assuming  that  all  convex  functions  are  implicitly  extended,  be.,  are  defined  as  oo 
outside  their  domains. 


f(x)  x  £  dom  / 

oo  x  dom/. 


Example  3.1  Indicator  function  of  a  convex  set.  Let  C  C  Rn  be  a  convex  set,  and 
consider  the  (convex)  function  Ic  with  domain  C  and  Ic(x)  =  0  for  all  x  £  C.  In 
other  words,  the  function  is  identically  zero  on  the  set  C.  Its  extended-value  extension 


3.1  Basic  properties  and  examples 
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Figure  3.2  If  /  is  convex  and  differentiable,  then  f  (x) +V  f  (x)T  (y  —  x)  <  f(y) 
for  all  x,  y  £  dom  /. 


is  given  by 


Ic(x)  = 


0  x  £  C 
oo  x  £  C. 


The  convex  function  Ic  is  called  the  indicator  function  of  the  set  C. 

We  can  play  several  notational  tricks  with  the  indicator  function  Ic-  For  example 
the  problem  of  minimizing  a  function  /  (defined  on  all  of  R",  say)  on  the  set  C  is  the 
same  as  minimizing  the  function  f  +  Ic  over  all  of  R’\  Indeed,  the  function  f  +  Ic 
is  (by  our  convention)  /  restricted  to  the  set  C. 


In  a  similar  way  we  can  extend  a  concave  function  by  defining  it  to  be  —  oo 
outside  its  domain. 


3.1.3  First-order  conditions 

Suppose  /  is  differentiable  (i.e.,  its  gradient  V/  exists  at  each  point  in  dom/, 
which  is  open) .  Then  /  is  convex  if  and  only  if  dom  /  is  convex  and 


f(y)>  f(x)  +  \7f(x)T(y-x)  (3.2) 

holds  for  all  x,  y  £  dom/.  This  inequality  is  illustrated  in  figure  3.2. 

The  affine  function  of  y  given  by  f(x)+\7f(x)T(y—x)  is,  of  course,  the  first-order 
Taylor  approximation  of  /  near  x.  The  inequality  (3.2)  states  that  for  a  convex 
function,  the  first-order  Taylor  approximation  is  in  fact  a  global  underestimator  of 
the  function.  Conversely,  if  the  first-order  Taylor  approximation  of  a  function  is 
always  a  global  underestimator  of  the  function,  then  the  function  is  convex. 

The  inequality  (3.2)  shows  that  from  local  information  about  a  convex  function 
(i.e.,  its  value  and  derivative  at  a  point)  we  can  derive  global  information  (i.e.,  a 
global  underestimator  of  it).  This  is  perhaps  the  most  important  property  of  convex 
functions,  and  explains  some  of  the  remarkable  properties  of  convex  functions  and 
convex  optimization  problems.  As  one  simple  example,  the  inequality  (3.2)  shows 
that  if  V/(x)  =  0,  then  for  all  y  £  dom  /,  f(y)  >  f(x),  i.e.,  x  is  a  global  minimizer 
of  the  function  /. 
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Strict  convexity  can  also  be  characterized  by  a  first-order  condition:  /  is  strictly 
convex  if  and  only  if  dom  /  is  convex  and  for  x,  y  £  dom  /,  x  ^  y,  we  have 

f(y)  >  f(x)  +  Vf(x)T(y-x).  (3.3) 

For  concave  functions  we  have  the  corresponding  characterization:  /  is  concave 
if  and  only  if  dom  /  is  convex  and 

f(y)  <  f{x)  +  V.f(x)T(y  -  x) 

for  all  x,  y  £  dom/. 

Proof  of  first-order  convexity  condition 

To  prove  (3.2),  we  first  consider  the  case  n  =  1:  We  show  that  a  differentiable 
function  /  :  R  — >  R  is  convex  if  and  only  if 

/(</)>  f(x)  +  f(x){y-x)  (3.4) 

for  all  x  and  y  in  dom  /. 

Assume  first  that  /  is  convex  and  x,  y  £  dom/.  Since  dom  /  is  convex  ( i.e 
an  interval),  we  conclude  that  for  all  0  <  t  <  1,  x  +  t(y  —  x)  £  dom/,  and  by 
convexity  of  /, 

f{x  +  f(y  -  x))  <  (1  -  f)/(x)  +  tf(y). 

If  we  divide  both  sides  by  t ,  we  obtain 

/M>/W+/(l  +  lfl'7))~/W, 

and  taking  the  limit  as  t  — >  0  yields  (3.4). 

To  show  sufficiency,  assume  the  function  satisfies  (3.4)  for  all  x  and  y  in  dom  / 
(which  is  an  interval).  Choose  any  x  ^  y,  and  0  <  6  <  1,  and  let  z  =  Ox  +  (1  —  8)y. 
Applying  (3.4)  twice  yields 

f(x)  >  f(z)  +  f'(z)(x  -  z),  f(y)  >  f(z)  +  f'{z){y  -  z). 
Multiplying  the  first  inequality  by  6,  the  second  by  1  —  0,  and  adding  them  yields 

8f{x)  +  (1  -  0)f(y)  >  f(z), 
which  proves  that  /  is  convex. 

Now  we  can  prove  the  general  case,  with  /  :  Rra  — >  R.  Let  x,  y  £  R”  and 
consider  /  restricted  to  the  line  passing  through  them,  i.e.,  the  function  defined  by 
g{t)  =  f(ty  +  (1  -  t)x),  so  g'{t)  =  V/(fy  +  (1  -  t)x)T (y  -  x). 

First  assume  /  is  convex,  which  implies  g  is  convex,  so  by  the  argument  above 
we  have  g(l)  >  g( 0)  +  g'( 0),  which  means 

f(y)  >  f(x)  +  S/f(x)T{y  -  x). 

Now  assume  that  this  inequality  holds  for  any  x  and  y,  so  if  ty  +  {  1  —  t)x  £  dom  / 
and  ty  +  { 1  —  t)x  £  dom  /,  we  have 

f(ty  +  (1  -  t)x)  >  f(ty  +  (1  -  t)x)  +  S7f(ty  +  (1  -  t)x)T(y  -  x)(t  -  t), 

i.e.,  g{t)  >  g(t)  +  g' (t)[t  —  t).  We  have  seen  that  this  implies  that  g  is  convex. 


3.1  Basic  properties  and  examples 
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3.1.4  Second-order  conditions 

We  now  assume  that  /  is  twice  differentiable,  that  is,  its  Hessian  or  second  deriva¬ 
tive  V2/  exists  at  each  point  in  dom /,  which  is  open.  Then  /  is  convex  if  and 
only  if  dom  /  is  convex  and  its  Hessian  is  positive  semidefinite:  for  all  x  £  dom  /, 

V2/(: r)  h  0. 

For  a  function  on  R,  this  reduces  to  the  simple  condition  f"{x)  >  0  (and  dom/ 
convex,  j.e.,  an  interval),  which  means  that  the  derivative  is  nondecreasing.  The 
condition  V2/(x)  0  can  be  interpreted  geometrically  as  the  requirement  that  the 

graph  of  the  function  have  positive  (upward)  curvature  at  x.  We  leave  the  proof 
of  the  second-order  condition  as  an  exercise  (exercise  3.8). 

Similarly,  /  is  concave  if  and  only  if  dom  /  is  convex  and  V2/(x)  ^  0  for 
all  x  £  dom  /.  Strict  convexity  can  be  partially  characterized  by  second-order 
conditions.  If  V2/(x)  >-  0  for  all  x  £  dom/,  then  /  is  strictly  convex.  The 
converse,  however,  is  not  true:  for  example,  the  function  /  :  R  — >  R  given  by 
f(x)  =  x4  is  strictly  convex  but  has  zero  second  derivative  at  x  =  0. 


Example  3.2  Quadratic  functions.  Consider  the  quadratic  function  /  :  R”  — >  R,  with 
dom  /  =  R" ,  given  by 

f(x)  —  (1/2  )xT  Px  +  qT  x  +  r, 

with  P  £  S™,  q  £  Rn,  and  r  £  R.  Since  V2/(x)  =  P  for  all  x,  f  is  convex  if  and  only 
if  P  P  0  (and  concave  if  and  only  if  P  ^  0). 

For  quadratic  functions,  strict  convexity  is  easily  characterized:  /  is  strictly  convex 
if  and  only  if  P  y  0  (and  strictly  concave  if  and  only  if  P  -<  0). 


Remark  3.1  The  separate  requirement  that  dom  /  be  convex  cannot  be  dropped  from 
the  first-  or  second-order  characterizations  of  convexity  and  concavity.  For  example, 
the  function  f(x)  =  1/x2,  with  dom  /  =  {i  6  R  |  i  /  0},  satisfies  f"(x)  >  0  for  all 
x  £  dom/,  but  is  not  a  convex  function. 


3.1.5  Examples 

We  have  already  mentioned  that  all  linear  and  affine  functions  are  convex  (and 
concave),  and  have  described  the  convex  and  concave  quadratic  functions.  In  this 
section  we  give  a  few  more  examples  of  convex  and  concave  functions.  We  start 
with  some  functions  on  R,  with  variable  x. 

•  Exponential.  eaxi  is  convex  on  R,  for  any  a  £  R. 

•  Powers.  x“|is  convex  on  R++  when  a  >  1  or  a  <  0,  and  concave  for  0  <  a  <  1. 

•  Powers  of  absolute  value.  \x\p,  for  p  >  1,  is  convex  on  R. 

•  Logarithm,  logx  is  concave  on  R++. 
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•  Negative  entropy,  xlogx  (either  on  R++,  or  on  R+,  defined  as  0  for  x  =  0) 
is  convex. 

Convexity  or  concavity  of  these  examples  can  be  shown  by  verifying  the  ba¬ 
sic  inequality  (3.1),  or  by  checking  that  the  second  derivative  is  nonnegative  or 
nonpositive.  For  example,  with  f(x)  =  a:  log  a:  we  have 

f\x)  =  logar+l,  f"(x)  =  1/x, 

so  that  f"(x)  >  0  for  x  >  0.  This  shows  that  the  negative  entropy  function  is 
(strictly)  convex. 

We  now  give  a  few  interesting  examples  of  functions  on  R". 

•  Norms.  Every  norm  on  R"  is  convex. 

•  Max  function.  f{x)  =|max{a:i, . . . ,  xn}  is  convex  on  R”. 

•  Quadratic-over-linear  function.  The  function  f(x,  y)  =  x2/y,  with 

dom /  =  R  x  R++  =  {(x,  y)  €  R2  |  y  >  0}, 
is  convex  (figure  3.3). 

•  Log-sum-exp.  The  function  /( x)  =  log  (eXl  H — ■  +  eXn)  is  convex  on  R". 
This  function  can  be  interpreted  as  a  differentiable  (in  fact,  analytic)  approx¬ 
imation  of  the  max  function,  since 

max{xi, . . .  ,xn}  <  f(x)  <  maxjaq, . . .  ,xn}  +  logn 

for  all  x.  (The  second  inequality  is  tight  when  all  components  of  x  are  equal.) 
Figure  3.4  shows  /  for  n  =  2. 


3.1  Basic  properties  and  examples 
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•  Geometric  mean.  The  geometric  mean  /( x)  =  is  concave  on 

dom/  =  R”  +  . 

•  Log-determinant.  The  function  f(X )  =  logdetA'  is  concave  on  dom/  = 

cj  n 
^++* 

Convexity  (or  concavity)  of  these  examples  can  be  verified  in  several  ways, 
such  as  directly  verifying  the  inequality  (3.1),  verifying  that  the  Hessian  is  positive 
semidefinite,  or  restricting  the  function  to  an  arbitrary  line  and  verifying  convexity 
of  the  resulting  function  of  one  variable. 

Norms.  If  /  :  R"  ->•  R  is  a  norm,  and  0  <  6  <  1,  then 

f{0x  +  (1  -  0)y)  <  f{0x)  +  /(( 1  -  6)y)  =  9f{ x)  +  (1  -  9)f{y). 

The  inequality  follows  from  the  triangle  inequality,  and  the  equality  follows  from 
homogeneity  of  a  norm. 

Max  function.  The  function  /( x)  =  max,;  x.t  satisfies,  for  0  <  9  <  1, 
f{9x  +  (1  -  9\)  =  max(te;  +  (1  -  9)yi) 

i 

<  9  max  Xi  +  (1  —  9)  max  yi 

1  i 

=  9f(x)  +  (l-9)f(y). 


Quadratic-over-linear  function.  To  show  that  the  quadratic-over-linear  function 
f{x,y)  =  x2  jy  is  convex,  we  note  that  (for  y  >  0), 


V2f{x,y) 


0. 
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Log-sum-exp.  The  Hessian  of  the  log-sum-exp  function  is 

V2/( x)  =  ^  2  ((1T2:)  diag(»  -  zzT)  , 

where  z  =  ( eXl , . . .  ,eXn ).  To  verify  that  X2f(x)  ^  0  we  must  show  that  for  all  v, 
vTV2f(x)v  >  0,  i.e., 


vTV2  f(x)v 


>  0. 


But  this  follows  from  the  Cauchy-Schwarz  inequality  ( aTa)(bTb )  >  ( aTb )2  applied 
to  the  vectors  with  components  cq  =  Viy/zi,  bi  =  yfzi- 


Geometric  mean.  In  a  similar  way  we  can  show  that  the  geometric  mean  f(x)  — 
(nr=i  is  concave  on  dom  /  =  R"  +  .  Its  Hessian  X2f(x)  is  given  by 


d2f{x)  ^(IILi^ 

dxl  J 


dA 


n2xl 


92f(x)  _  (n:ii^) 

dxk.dxi 


l/r 


n2xkxi 


for  k  ^  l, 


and  can  be  expressed  as 


V2f(x)  =  - 


niu* 


l/r 


(n  diag(l/a;2, . . . ,  1/a;2 )  -  qqT) 


where  q,  =  1/a q.  We  must  show  that  V2/( x)  A  0,  i.e.,  that 


vT\72  f(x)v 


=i 

n 


<  0 


for  all  v.  Again  this  follows  from  the  Cauchy-Schwarz  inequality  ( aTa){bTb )  > 
( aTb )2,  applied  to  the  vectors  a  =  1  and  bi  =  Vi/xi . 


Log-determinant.  For  the  function  f(X)  =  logdet  X,  we  can  verify  concavity  by 
considering  an  arbitrary  line,  given  by  X  =  Z  +  tV,  where  Z,  V  €  SB.  We  define 
g(t)  =  f(Z  +  tV),  and  restrict  g  to  the  interval  of  values  of  t  for  which  Z  +  tV  >~  0. 
Without  loss  of  generality,  we  can  assume  that  t  =  0  is  inside  this  interval,  i.e., 
Z  y  0.  We  have 


g(t )  =  logdet  (Z  +  tV) 

=  log  det(Z1/2(J  +  tZ~ll2VZ~1^2)Z1^2) 

n 

=  y  log(l  +  t\j)  +  log  det  Z 

i= 1 

where  Ai, . . . ,  \n  are  the  eigenvalues  of  Z~1^2V Z~x!2 .  Therefore  we  have 


n  \  n  \2 

ttA- 

2=1  2=1 
Since  g"(t)  <  0,  we  conclude  that  /  is  concave. 


To  prove  Sublevel  Set: 

Take  x1,x2  belongs  to  C-alpha,and  xA=convex  combination  of  xl  and  x2.  and  apply  f  both  sides 


3.1  Basic  properties  and  examples 
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3.1.6  Sublevel  sets 


The  a-sublevel  set  of  a  function  /  :  Rra  — >  R  is  defined  as 

Ca  =  {x  £  dom  /  |  f(x)  <  a}. 

Sublevel  sets  of  a  convex  function  are  convex,  for  any  value  of  a.  The  proof  is 
immediate  from  the  definition  of  convexity:  if  x,  y  £  Ca,  then  /( x)  <  a  and 
f(y )  <  a,  and  so  f(6x  +  (1  —  9)y)  <  a  for  0  <  9  <  1,  and  hence  Ox  +  (1  —  9)y  £  Ca. 

The  converse  is  not  true:  a  function  can  have  all  its  sublevel  sets  convex,  but 
not  be  a  convex  function.  For  example,  f{x)  —  —  ex  is  not  convex  on  R  (indeed,  it 
is  strictly  concave)  but  all  its  sublevel  sets  are  convex. 

If  /  is  concave,  then  its  a-superlevel  set ,  given  by  {x  £  dom  /  |  /( x)  >  a },  is  a 
convex  set.  The  sublevel  set  property  is  often  a  good  way  to  establish  convexity  of 
a  set,  by  expressing  it  as  a  sublevel  set  of  a  convex  function,  or  as  the  superlevel 
set  of  a  concave  function. 


the  point  on 
the  graph 
corresponding 
to 

x  lies  on  or 
below  the 
horizontal 
plane  y=a 


Example  3.3  The  geometric  and  arithmetic  means  of  x  £  R”  are,  respectively, 


inequality  states  that  G(x)  <  A(x). 

Suppose  0  <  a  <  1,  and  consider  the  set 

{*  £  R"  |  G(x)  >  aA(x)}, 


i.e.,  the  set  of  vectors  with  geometric  mean  at  least  as  large  as  a  factor  a  times  the 
arithmetic  mean.  This  set  is  convex,  since  it  is  the  O-superlevel  set  of  the  function 
G{x)  —  aA(x),  which  is  concave.  In  fact,  the  set  is  positively  homogeneous,  so  it  is  a 
convex  cone. 


3.1.7  Epigraph 

The  graph  of  a  function  /  :  R”  — >•  R  is  defined  as 

{(xj(x))  |  x  £  dom/}, 

which  is  a  subset  of  Rn+  .  The  epigraph  of  a  function  /  :  R”  — >  R  is  defined  as 
epi /  =  {(M)  I  x  £  dom/,  f(x)  <  f}, 

which  is  a  subset  of  R" + 1 .  (‘Epi’  means  ‘above’  so  epigraph  means  ‘above  the 
graph’.)  The  definition  is  illustrated  in  figure  3.5. 

The  link  between  convex  sets  and  convex  functions  is  via  the  epigraph:  A 
function  is  convex  if  and  only  if  its  epigraph  is  a  convex  set.  A  function  is  concave 
if  and  0x8$ 'S  Its  dfettmiti  as! 

hypo  /  =  {( x,t )  |  t  <  f(x)}, 


is  a  convex  set. 
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Figure  3.5  Epigraph  of  a  function  /,  shown  shaded.  The  lower  boundary, 
shown  darker,  is  the  graph  of  /. 


Example  3.4  Matrix  fractional  function.  The  function  /  :  Rn  x  S“  -)  R,  defined  as 

f(x,  Y )  =  xrY~1x 

is  convex  on  dom  f  =  R11  x  S++.  (This  generalizes  the  quadratic-over-linear  function 
fix,  y)  =  x2/y,  with  dom  /  =  R  x  R++.) 

One  easy  way  to  establish  convexity  of  /  is  via  its  epigraph: 

epi  /  =  {(a:,  Y,t)  |  Y  y  0,  xTY~1x<t} 

h  0,  Y  y  o| , 

using  the  Schur  complement  condition  for  positive  semidefiniteness  of  a  block  matrix 
(see  §A.5.5).  The  last  condition  is  a  linear  matrix  inequality  in  ( x ,  Y.  t),  and  therefore 
epi  /  is  convex. 

For  the  special  case  n  =  1,  the  matrix  fractional  function  reduces  to  the  quadratic- 
over-linear  function  x2  /y,  and  the  associated  LMI  representation  is 


=  {  (*,  Y,  t) 


y  x 

X  t 


h  0, 


y  >  o 


(the  graph  of  which  is  shown  in  figure  3.3). 


Many  results  for  convex  functions  can  be  proved  (or  interpreted)  geometrically 
using  epigraphs,  and  applying  results  for  convex  sets.  As  an  example,  consider  the 
first-order  condition  for  convexity: 

f{y)  >  fix)  +  \Jf{x)T{y  -  x), 

where  /  is  convex  and  x,  y  £  dom  f.  We  can  interpret  this  basic  inequality 
geometrically  in  terms  of  epi/.  If  (y,f)  €  epi/,  then 

t  >  f(y)  >  .fix)  +  Vf{x)T(y  -  x). 


3.1  Basic  properties  and  examples 
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Figure  3.6  For  a  differentiable  convex  function  /,  the  vector  (V/(s),—  1) 
defines  a  supporting  hyperplane  to  the  epigraph  of  /  at  x. 


We  can  express  this  as: 

{y,  t)  G  epi  / 


V/(x) 

-1 


T 


X 

fix) 


<  0. 


This  means  that  the  hyperplane  defined  by  (V/(x),—  1)  supports  epi /  at  the 
boundary  point  (x,f(x));  see  figure  3.6. 


3.1.8  Jensen’s  inequality  and  extensions 

The  basic  inequality  (3.1),  i.e., 

fi9x+(l-0)y)<0f(x)  +  (l-9)f(y), 

is  sometimes  called  Jensen’s  inequality.  It  is  easily  extended  to  convex  combinations 
of  more  than  two  points:  If  /  is  convex,  xi, . . . , Xk  G  dom /,  and  0i, ...  ,0k  >  0 
with  9\  H - +  Ok  =  1,  then 

f{0\X\  +  •  •  •  +  OkXk )  <  0\f(xi)  +  •  ■  ■  +  0kf[xk). 

As  in  the  case  of  convex  sets,  the  inequality  extends  to  infinite  sums,  integrals,  and 
expected  values.  For  example,  if p(x)  >  0  on  SC  dom/,  fsp(x )  dx  =  1,  then 


provided  the  integrals  exist.  In  the  most  general  case  we  can  take  any  probability 
measure  with  support  in  dom/.  If  cc  is  a  random  variable  such  that  x  G  dom / 
with  probability  one,  and  /  is  convex,  then  we  have 

f(Ex)<Ef(x),  (3.5) 


provided  the  expectations  exist.  We  can  recover  the  basic  inequality  (3.1)  from 
this  general  form,  by  taking  the  random  variable  x  to  have  support  {21,2:2},  with 
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prob(x  =  X\)  =  9,  prob(x  =  X2)  =  1  —  9.  Thus  the  inequality  (3.5)  characterizes 
convexity:  If  /  is  not  convex,  there  is  a  random  variable  x,  with  x  £  dom  /  with 
probability  one,  such  that  /( Ex)  >  E  f{x). 

All  of  these  inequalities  are  now  called  Jensen’s  inequality ,  even  though  the 
inequality  studied  by  Jensen  was  the  very  simple  one 


/ 


f(x)  +  f{y) 
2 


Remark  3.2  We  can  interpret  (3.5)  as  follows.  Suppose  x  £  dom  /  C  Rn  and  z  is 
any  zero  mean  random  vector  in  R’1.  Then  we  have 


E  f{x  +  z)  >  f(x). 


Thus,  randomization  or  dithering  (i.e.,  adding  a  zero  mean  random  vector  to  the 
argument)  cannot  decrease  the  value  of  a  convex  function  on  average. 


3.1.9  Inequalities 


Many  famous  inequalities  can  be  derived  by  applying  Jensen’s  inequality  to  some 
appropriate  convex  function.  (Indeed,  convexity  and  Jensen’s  inequality  can  be 
made  the  foundation  of  a  theory  of  inequalities.)  As  a  simple  example,  consider 
the  arithmetic-geometric  mean  inequality: 

Vab<(a  +  b)/2  (3-6) 

for  a,  b  >  0.  The  function  —  log  x  is  convex;  Jensen’s  inequality  with  9  =  1/2  yields 


-log 


a  +  b\  —logo  —  log  b 


< 


Taking  the  exponential  of  both  sides  yields  (3.6). 

As  a  less  trivial  example  we  prove  Holder’s  inequality:  for  p  >  1,  1/p+l/q  =  1, 
and  x,  y  £  E", 

n  /  n  \  1/P  /  n  \  H1? 


)  5Zlyil9) 

2=1  \2  =  1  /  \i  =  l  / 

By  convexity  of  —  log  2:,  and  Jensen’s  inequality  with  general  0,  we  obtain  the  more 
general  arithmetic-geometric  mean  inequality 


a eb1~B  <  0a +(1-0)6, 


valid  for  a,  b  >  0  and  0  <  0  <  1.  Applying  this  with 


3.2  Operations  that  preserve  convexity 
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3.2  Operations  that  preserve  convexity 

In  this  section  we  describe  some  operations  that  preserve  convexity  or  concavity 
of  functions,  or  allow  us  to  construct  new  convex  and  concave  functions.  We  start 
with  some  simple  operations  such  as  addition,  scaling,  and  pointwise  supremum, 
and  then  describe  some  more  sophisticated  operations  (some  of  which  include  the 
simple  operations  as  special  cases). 


3.2.1  Nonnegative  weighted  sums 


Evidently  if  /  is  a  convex  function  and  a  >  0,  then  the  function  af  is  convex. 
If  /i  and  are  both  convex  functions,  then  so  is  their  sum  /i  +  ^.“Combining 

nonnegative  scaling  and  addition,  we  see  that  the  set  of  convex  functions  is  itself  a 
convex  cone:  a  nonnegative  weighted  sum  of  convex  functions, 


f  —  ^ l/l  T  ’  ’  '  T  Wmfrm 

is  convex.  Similarly,  a  nonnegative  weighted  sum  of  concave  functions  is  concave.  A 
nonnegative,  nonzero  weighted  sum  of  strictly  convex  (concave)  functions  is  strictly 
convex  (concave). 

These  properties  extend  to  infinite  sums  and  integrals.  For  example  if  f{x,y) 
is  convex  in  x  for  each  y  G  A,  and  w(y)  >  0  for  each  y  G  then  the  function  g 

g{x)  =  /  w(y)f(x,  y)  dy 
Ja 

is  convex  in  x  (provided  the  integral  exists). 

The  fact  that  convexity  is  preserved  under  nonnegative  scaling  and  addition  is 
easily  verified  directly,  or  can  be  seen  in  terms  of  the  associated  epigraphs.  For 
example,  if  w  >  0  and  /  is  convex,  we  have 


epi  (wf) 


I  0 
0  w 


epi/, 


which  is  convex  because  the  image  of  a  convex  set  under  a  linear  mapping  is  convex. 


3.2.2  Composition  with  an  affine  mapping 

Suppose  /  :  R"  -G  R,  A  €  Rraxm,  and  b  G  Rn.  Define  g  :  Rm  — >  R  by 

with  dom  g  =  {x  \  Ax  +  b  G  dom  /}.  Then  if  /  is  convex,  so  is  g ;  if  /  is  concave, 
so  is  g. 
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3.2.3  Pointwise  maximum  and  supremum 

If  /i  and  /2  are  convex  functions  then  their  pointwise  maximum  /,  defined  by 


f(x)  =  max{/i(x),  /2(x)}, 

with  dom  /  =  dom  /i  fl  dom  /2 ,  is  also  convex.  This  property  is  easily  verified:  if 
0  <  9  <  1 


f{0x  +  (1  —  0)y) 


=  max{/i(6»x  +  (1  -  9)y),f2{9x  +  (1  -  %)} 

<  max{6>/i(x)  +  (1  -  0)f1{y),  6f2{x)  +  (1  -  0)f2{y)} 

<  9  max{/i(x),  /2(x)}  +  (1  -  0)  max{/i(y),  f2(y)} 

=  9f(x)  +  (l-9)f(y), 


which  establishes  convexity  of  /.  It  is  easily  shown  that  if  /i, . . . ,  fm  are  convex, 
then  their  pointwise  maximum 


f{x)  =  max{/i(x), . . . ,  fm(x)} 


is  also  convex. 


Example  3.5  Piecewise-linear  functions.  The  function 

f(x)  =  max{af  x  +  bi, . . . ,  aj^x  +  foz,} 

defines  a  piecewise-linear  (or  really,  affine)  function  (with  L  or  fewer  regions).  It  is 
convex  since  it  is  the  pointwise  maximum  of  affine  functions. 

The  converse  can  also  be  shown:  any  piecewise-linear  convex  function  with  L  or  fewer 
regions  can  be  expressed  in  this  form.  (See  exercise  3.29.) 


Example  3.6  Sum  of  r  largest  components.  For  x  €  Rn  we  denote  by  ami  the  ith 
largest  component  of  x,  i.e., 


X[l]  >  X[2]  >  '  '  '  >  *[n] 

are  the  components  of  x  sorted  in  nonincreasing  order.  Then  the  function 

r 

Kx) 

i=l 

i.e.,  the  sum  of  the  r  largest  elements  of  x,  is  a  convex  function.  This  can  be  seen  by 
writing  it  as 

r 

f(oc)  =  ^  X[i\  =  maxjx^  H - 1-  xir  \  1  <  i\  <  i2  <  ■  •  ■  <  ir  <  n}, 

i= 1 

i.e.,  the  maximum  of  all  possible  sums  of  r  different  components  of  x.  Since  it  is  the 
pointwise  maximum  of  n!/(r!(n  —  r)!)  linear  functions,  it  is  convex. 

As  an  extension  it  can  be  shown  that  the  function  WiX[i]  is  convex,  provided 

wi  >  t»2  >  ■  •  •  >  Mr  >  0.  (See  exercise  3.19.) 


3.2  Operations  that  preserve  convexity 
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The  pointwise  maximum  property  extends  to  the  pointwise  supremum  over  an 
infinite  set  of  convex  functions.  If  for  each  y  £  A,  f{x,y)  is  convex  in  x,  then  the 

g(x)  =  sup/(x,y)  (3.7) 

y£A 

is  convex  in  x.  Here  the  domain  of  g  is 

dom g  =  {x  |  (x,  y)  £  dom  /  for  all  y  £  A,  sup  f(x,  y)  <  oo}. 

yeA' 

Similarly,  the  pointwise  infimum  of  a  set  of  concave  functions  is  a  concave  function. 

In  terms  of  epigraphs,  the  pointwise  supremum  of  functions  corresponds  to  the 
intersection  of  epigraphs:  with  /,  g ,  and  A  as  defined  in  (3.7),  we  have 

epi.9  =  Q  epi/(-,y). 

y€A 

Thus,  the  result  follows  from  the  fact  that  the  intersection  of  a  family  of  convex 
sets  is  convex. 


Example  3.7  Support  function  of  a  set.  Let  C  C  Rn,  with  C  ^  0.  The  support 
function  Sc  associated  with  the  set  C  is  defined  as 

Sc{x)  =  sup{xTy  |  y  £  C} 

(and,  naturally,  dom  Sc  =  {x  |  sup yeCxTy  <  oo}). 

For  each  y  £  C,  xTy  is  a  linear  function  of  x,  so  Sc  is  the  pointwise  supremum  of  a 
family  of  linear  functions,  hence  convex. 


Example  3.8  Distance  to  farthest  point  of  a  set.  Let  C  C  Rn.  The  distance  (in  any 
norm)  to  the  farthest  point  of  C, 

/Or)  =  sup  ||x  -  y ||, 
yec 

is  convex.  To  see  this,  note  that  for  any  y,  the  function  ||x  —  y\\  is  convex  in  x.  Since 
/  is  the  pointwise  supremum  of  a  family  of  convex  functions  (indexed  by  y  £  C),  it 
is  a  convex  function  of  x. 


Example  3.9  Least-squares  cost  as  a  function  of  weights.  Let  ai, . . . ,  an  £  Rm.  In  a 
weighted  least-squares  problem  we  minimize  the  objective  function  y}”_,  Wj(afx  — 
bi )2  over  x  £  Rm.  We  refer  to  Wi  as  weights,  and  allow  negative  Wi  (which  opens  the 
possibility  that  the  objective  function  is  unbounded  below). 

We  define  the  (optimal)  weighted  least-squares  cost  as 


n 

g(w)  =  inf  ^^Wiiajx  -  bi)2 , 

i= 1 

with  domain 


n  'j 

dom  g  =  < 

[' 

inf  y  ( Wi (qf  x  —  bi)2  >  — oo 

i=i  J 
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Since  g  is  the  infimum  of  a  family  of  linear  functions  of  w  (indexed  by  x  £  Rm),  it  is 
a  concave  function  of  w. 

We  can  derive  an  explicit  expression  for  g,  at  least  on  part  of  its  domain.  Let 
W  =  diag(w),  the  diagonal  matrix  with  elements  wi, . . .  ,wn,  and  let  A  £  Rnxm 
have  rows  aj ,  so  we  have 

g{w)  =  inf(Ar  —  b)TW(Ax  —  b)  =  'mi(xT  A?  W  Ax  —  2bT  W  Ax  +  bTWb). 

X  X 

From  this  we  see  that  if  ATWA  ^  0,  the  quadratic  function  is  unbounded  below 
in  x,  so  g(w)  =  —  oo,  i.e.,  w  0  domj.  We  can  give  a  simple  expression  for  g 
when  ATW A  y  0  (which  defines  a  strict  linear  matrix  inequality),  by  analytically 
minimizing  the  quadratic  function: 

g(w)  =  bTWb-bTWA(ATWA)~1ATWb 

n  n  /  n  \  ^ 

E,2  2,2  T  I  T  1 

Wibi  —  >  W  ibiCli  y  UljOjOj  CLi. 

i=l  i=  1  \j=l  / 

Concavity  of  g  from  this  expression  is  not  immediately  obvious  (but  does  follow,  for 
example,  from  convexity  of  the  matrix  fractional  function;  see  example  3.4). 


Example  3.10  Maximum  eigenvalue  of  a  symmetric  matrix.  The  function  f(X)  — 
Amax(A'),  with  dom/  =  Sm,  is  convex.  To  see  this,  we  express  /  as 

f{X)  =  sup {yTXy  \  \\y\\2  =  1}, 

i.e.,  as  the  pointwise  supremum  of  a  family  of  linear  functions  of  X  (i.e.,  yTXy) 
indexed  by  y  £  Rm. 


Example  3.11  Norm  of  a  matrix.  Consider  f(X)  =  ||A'||2  with  dom /  =  Rpxq, 
where  ||  •  ||2  denotes  the  spectral  norm  or  maximum  singular  value.  Convexity  of  / 
follows  from 

f(X)  =  sup{uTA'w  |  ||«||2  =  1,  |M|2  =  1}, 
which  shows  it  is  the  pointwise  supremum  of  a  family  of  linear  functions  of  A. 

As  a  generalization  suppose  ||  •  ||a  and  ||  •  ||&  are  norms  on  Rp  and  R9,  respectively. 
The  induced  norm  of  a  matrix  X  £  RpX9  is  defined  as 


||A|U,6  =  sup 

v^O 


IMI* 


(This  reduces  to  the  spectral  norm  when  both  norms  are  Euclidean.)  The  induced 
norm  can  be  expressed  as 


|| A||a!£,  =  sup{||An||a  |  |M|b  =  1} 

=  sup{mtXu  I  ||m||0*  =  1,  ||t>||6  =  1}, 

where  ||  •  ||a*  is  the  dual  norm  of  ||  •  ||a,  and  we  use  the  fact  that 

Iloilo  =  SUp{ltTZ  |  || U || a*  =  1}. 

Since  we  have  expressed  ||  A||a,6  as  a  supremum  of  linear  functions  of  X,  it  is  a  convex 
function. 


3.2  Operations  that  preserve  convexity 
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Representation  as  pointwise  supremum  of  affine  functions 

The  examples  above  illustrate  a  good  method  for  establishing  convexity  of  a  func¬ 
tion:  by  expressing  it  as  the  pointwise  supremum  of  a  family  of  affine  functions. 
Except  for  a  technical  condition,  a  converse  holds:  almost  every  convex  function 
can  be  expressed  as  the  pointwise  supremum  of  a  family  of  affine  functions.  For 
example,  if  /  :  R"  — >  R  is  convex,  with  dom  /  =  R™,  then  we  have 

/( x)  =  sup{g(:r)  |  g  affine,  g{z)  <  f(z)  for  all  z}. 

In  other  words,  is  the  pointwise  supremum  of  the  set  of  all  affine  global  under¬ 
estimators  of  it.  We  give  the  proof  of  this  result  below,  and  leave  the  case  where 
dom  /  7^  R"  as  an  exercise  (exercise  3.28). 

Suppose  /  is  convex  with  dom  /  =  R™.  The  inequality 

f(x)  >  sup{gr(:r)  |  g  affine,  g(z)  <  f(z)  for  all  z} 

is  clear,  since  if  g  is  any  affine  underestimator  of  /,  we  have  g(x)  <  f(x).  To 
establish  equality,  we  will  show  that  for  each  x  £  R",  there  is  an  affine  function  g , 
which  is  a  global  underestimator  of  /,  and  satisfies  g(x)  =  f{x). 

The  epigraph  of  /  is,  of  course,  a  convex  set.  Hence  we  can  find  a  supporting 
hyperplane  to  it  at  (x,  i.e.,  a  £  R"  and  b  £  R  with  (a,  b)  ^  0  and 


a 

T 

x  —  z 

b 

_  f{x)  - 1 

for  all  (z,t)  £  epi /.  This  means  that 

aT(x  -  z)  +  b(f(x)  -  f(z)  -  s)  <  0  (3.8) 

for  all  z  £  dom  /  =  R"  and  all  s  >  0  (since  ( z,t )  £  epi  /  means  t  =  f(z)  +  s  for 
some  s  >  0).  For  the  inequality  (3.8)  to  hold  for  all  s  >  0,  we  must  have  b  >  0. 
If  b  =  0,  then  the  inequality  (3.8)  reduces  to  aT(x  —  z)  <  0  for  all  z  £  Rra,  which 
implies  a  =  0  and  contradicts  (a,b)  ^  0.  We  conclude  that  b  >  0,  i.e.,  that  the 
supporting  hyperplane  is  not  vertical. 

Using  the  fact  that  b  >  0  we  rewrite  (3.8)  for  s  =  0  as 

g{z)  =  f(x)  +  (a/b)T(x  -  z)  <  f{z) 

for  all  z.  The  function  g  is  an  affine  underestimator  of  /,  and  satisfies  g(x)  =  f(x). 


3.2.4  Composition 


In  this  section  we  examine  conditions  on  h  :  RA  — >  R  and  g  :  R“  — >  Rfc  that 
guarantee  convexity  or  concavity  of  their  composition  /  =  ho  g  :  Rn  — >  R,  defined 

by 


f(x)  =  h(g(x )),  dom /  =  {x  £  dom g  \  g{x)  £  dom/i}. 
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Scalar  composition 

We  first  consider  the  case  fc  =  l,  so/i:R— >-R  and  g  :  R"  — >  R.  We  can  restrict 
ourselves  to  the  case  n  =  1  (since  convexity  is  determined  by  the  behavior  of  a 
function  on  arbitrary  lines  that  intersect  its  domain). 

To  discover  the  composition  rules,  we  start  by  assuming  that  h  and  g  are  twice 
differentiable,  with  dom  g  =  dom/i  =  R.  In  this  case,  convexity  of  /  reduces  to 
f"  >  0  (meaning,  f"{x)  >  0  for  all  x  G  R). 

The  second  derivative  of  the  composition  function  /  =  h  o  g  is  given  by 

f"(x)  =  h"{g{x))g\xf  +  h'(g(x))g"(x).  (3.9) 

Now  suppose,  for  example,  that  g  is  convex  (so  g "  >  0)  and  h  is  convex  and 
nondecreasing  (so  h"  >  0  and  h!  >  0).  It  follows  from  (3.9)  that  /"  >  0,  i.e.,  /  is 
convex.  In  a  similar  way,  the  expression  (3.9)  gives  the  results: 

/  is  convex  if  h  is  convex  and  nondecreasing,  and  g  is  convex, 

f  is  convex  if  h  is  convex  and  nonincreasing,  and  q  is  concave, 

.  (3.10) 

/  is  concave  if  h  is  concave  and  nondecreasing,  and  g  is  concave, 

/  is  concave  if  h  is  concave  and  nonincreasing,  and  g  is  convex. 

These  statements  are  valid  when  the  functions  g  and  h  are  twice  differentiable  and 
have  domains  that  are  all  of  R.  It  turns  out  that  very  similar  composition  rules 
hold  in  the  general  case  n  >  1,  without  assuming  differentiability  of  h 
that  dom  g  =  Rn  and  dom  h  =  R: 

/  is  convex  if  h  is  convex,  h  is  nondecreasing,  and  g  is  convex, 

/  is  convex  if  h  is  convex,  h  is  nonincreasing,  and  g  is  concave, 

/  is  concave  if  h  is  concave,  h  is  nondecreasing,  and  g  is  concave, 

/  is  concave  if  h  is  concave,  h  is  nonincreasing,  and  g  is  convex. 

Here  h  denotes  the  extended-value  extension  of  the  function  h ,  which  assigns  the 
value  oo  (— oo)  to  points  not  in  dom h  for  h  convex  (concave).  The  only  difference 
between  these  results,  and  the  results  in  (3.10),  is  that  we  require  that  the  extended- 
value  extension  function  h  be  nonincreasing  or  nondecreasing,  on  all  of  R. 

To  understand  what  this  means,  suppose  h  is  convex,  so  h  takes  on  the  value  oo 
outside  dom  h.  To  say  that  h  is  nondecreasing  means  that  for  any  x,  y  £  R,  with 
x  <  y,  we  have  h{ x)  <  h(y).  In  particular,  this  means  that  if  y  £  dom/i,  then  x  £ 
dom  h.  In  other  words,  the  domain  of  h  extends  infinitely  in  the  negative  direction; 
it  is  either  R,  or  an  interval  of  the  form  (— oo,a)  or  (— oo,a].  In  a  similar  way,  to 
say  that  h  is  convex  and  h  is  nonincreasing  means  that  h  is  nonincreasing  and 
domft  extends  infinitely  in  the  positive  direction.  This  is  illustrated  in  figure  3.7. 


and  g,  or 


(3.11) 


Example  3.12  Some  simple  examples  will  illustrate  the  conditions  on  h  that  appear 
in  the  composition  theorems. 

•  The  function  h(x)  =  log  a;,  with  dom/i  =  R++,  is  concave  and  satisfies  h 
nondecreasing. 


3.2  Operations  that  preserve  convexity 
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x  x 

Figure  3.7  Left.  The  function  x2,  with  domain  R+,  is  convex  and  nonde¬ 
creasing  on  its  domain,  but  its  extended- value  extension  is  not  nondecreas¬ 
ing.  Right.  The  function  max{a:,  0}2 ,  with  domain  R,  is  convex,  and  its 
extended-value  extension  is  nondecreasing. 


•  The  function  h(x )  =  x* 1^2 ,  with  dom/i  =  R+,  is  concave  and  satisfies  the 
condition  h  nondecreasing. 

•  The  function  h(x)  =  x 3/2,  with  dom  h  =  R+,  is  convex  but  does  not  satisfy  the 
condition  h  nondecreasing.  For  example,  we  have  h(— 1)  =  oo,  but  h(  1)  =  1. 

•  The  function  h(x)  =  x3^2  for  x  >  0,  and  h(x)  =  0  for  x  <  0,  with  dom  h  =  R, 
is  convex  and  does  satisfy  the  condition  h  nondecreasing. 


The  composition  results  (3.11)  can  be  proved  directly,  without  assuming  dif¬ 
ferentiability,  or  using  the  formula  (3.9).  As  an  example,  we  will  prove  the  fol¬ 
lowing  composition  theorem:  if  g  is  convex,  h  is  convex,  and  h  is  nondecreasing, 
then  /  =  h  o  g  is  convex.  Assume  that,  a;,  y  G  dom/,  and  0  <  0  <  1.  Since 

i,  ye  dom/,  we  have  that  x,  y  G  domy  and  g(x),  g(y)  G  dom  A  Since  domy 
is  convex,  we  conclude  that  Ox  +  (1  —  0)y  G  dom  g,  and  from  convexity  of  g,  we 
have 

g(0x  +  (1  -  0)y)  <  0g{x)  +  (1  -  0)g(y).  (3.12) 

Since  g(x),  g(y)  G  domft,  we  conclude  that  6g{x)  +  (1  —  0)g(y)  G  dom/i,  i.e., 
the  righthand  side  of  (3.12)  is  in  domft.  Now  we  use  the  assumption  that  h 
is  nondecreasing,  which  means  that  its  domain  extends  infinitely  in  the  negative 
direction.  Since  the  righthand  side  of  (3.12)  is  in  domft,  we  conclude  that  the 
lefthand  side,  i.e.,  g(6x+(l—6)y)  G  domft.  This  means  that  0x+{l  —  0)y  G  dom/. 
At  this  point,  we  have  shown  that  dom  /  is  convex. 

Now  using  the  fact  that  h  is  nondecreasing  and  the  inequality  (3.12),  we  get 

h(g(0x  +  (1  -  0)y))  <  h(0g(x)  +  (1  -  0)g(y)).  (3.13) 

From  convexity  of  h,  we  have 

h{0g{x)  +  (1  -  0)g{y))  <  0h(g(x))  +  (1  -  0)h(g(y)). 


(3.14) 
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Putting  (3.13)  and  (3.14)  together,  we  have 

h(g(0x  +  (1  -  0)y))  <  Oh{g{x))  +  (1  -  9)h(g(y)). 

which  proves  the  composition  theorem. 

Example  3.13  Simple  composition  results. 

•  If  3  is  convex  then  expg(x)  is  convex. 

•  If  g  is  concave  and  positive,  then  log  g(x)  is  concave. 

•  If®  is  concave  and  positive,  then  l/g(x)  is  convex. 

•  If  g  is  convex  and  nonnegative  and  p  >  1,  then  g(x)p  is  convex. 

•  If  g  is  convex  then  —  log(— g(x))  is  convex  on  {x  \  g(x)  <  0}. 


Remark  3.3  The  requirement  that  monotonicity  hold  for  the  extended- value  extension 
h,  and  not  just  the  function  h,  cannot  be  removed.  For  example,  consider  the  function 
g(x)  =  x2,  with  domj  =  R,  and  h(x)  =  0,  with  dom/i  =  [1,2].  Here  g  is  convex, 
and  h  is  convex  and  nondecreasing.  But  the  function  f  =  ho  g,  given  by 

f(x)  =  0,  dom/  =  [— s/2, -1]  U  [1,  \/2], 

is  not  convex,  since  its  domain  is  not  convex|  Here,  of  course,  the  function  h  is  not 
nondecreasing. 


Vector  composition 

We  now  turn  to  the  more  complicated  case  when  k  >  1.  Suppose 
f(x)  =  h(g(x))  =  h(gi{x), . .  .,gk{x)), 

with  h  :  Rfc  — >  R,  gi  :  Rra  — >  R.  Again  without  loss  of  generality  we  can  assume  n  = 
1.  As  in  the  case  k  =  1,  we  start  by  assuming  the  functions  are  twice  differentiable, 
with  dom  g  =  R  and  dom/i  =  Rfc,  in  order  to  discover  the  composition  rules.  We 
have 

f"(x)  =  g'{x)TV2h{g(x))g'(  x)  +  Vh(g(x))Tg"(x),  (3.15) 

which  is  the  vector  analog  of  (3.9).  Again  the  issue  is  to  determine  conditions  under 
which  f"(x)  >  0  for  all  x  (or  f"(x)  <  0  for  all  x  for  concavity).  From  (3.15)  we 
can  derive  many  rules,  for  example: 

/  is  convex  if  h  is  convex,  h  is  nondecreasing  in  each  argument, 

and  gi  are  convex, 

/  is  convex  if  h  is  convex,  h  is  nonincreasing  in  each  argument, 

and  gi  are  concave, 

/  is  concave  if  h  is  concave,  h  is  nondecreasing  in  each  argument, 
and  gi  are  concave. 


3.2  Operations  that  preserve  convexity 
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As  in  the  scalar  case,  similar  composition  results  hold  in  general,  with  n  >  1,  no  as¬ 
sumption  of  differentiability  of  h  or  g ,  and  general  domains.  For  the  general  results, 
the  monotonicity  condition  on  h  must  hold  for  the  extended- value  extension  h. 

To  understand  the  meaning  of  the  condition  that  the  extended-value  exten¬ 
sion  h  be  monotonic,  we  consider  the  case  where  h  :  — >  R  is  convex,  and  h 

nondecreasing,  i.e.,  whenever  u  A  v,  we  have  h(u)  <  h(v).  This  implies  that  if 
v  £  dom/i,  then  so  is  u:  the  domain  of  h  must  extend  infinitely  in  the  — R^. 
directions.  We  can  express  this  compactly  as  dom  h  —  R+  =  dom  h. 


Example  3.14  Vector  composition  examples. 

•  Let  h(z)  =  Z[ i]  +  •  •  •  +Z[r] ,  the  sum  of  the  r  largest  components  of  z  £  Rfc.  Then 
h  is  convex  and  nondecreasing  in  each  argument.  Suppose  g\, . . . ,  are  convex 
functions  on  R™.  Then  the  composition  function  /  =  h  o  g,  i.e.,  the  pointwise 
sum  of  the  r  largest  gV s,  is  convex. 

•  The  function  h(z)  =  l°g(X^=1  eZ')  convex  and  nondecreasing  in  each  argu¬ 
ment,  so  log(£*=1  e9i)  is  convex  whenever  gi  are. 

•  For  0  <  p  <  1,  the  function  h(z )  =  (X^_i  zi)1^P  on  R+  is  concave,  and 
its  extension  (which  has  the  value  — oo  for  a  ^  0)  is  nondecreasing  in  each 
component.  So  if  gi  are  concave  and  nonnegative,  we  conclude  that  /(a;)  = 
(Si=i  9i(x)p)1/p  is  concave. 

•  Suppose  p  >  1,  and  gi, ...  ,gt  are  convex  and  nonnegative.  Then  the  function 
Cl2i=i9i{x)P)1/p  is  convex. 

To  show  this,  we  consider  the  function  h  :  Rfc  — >  R  defined  as 

\  i/p 

max{zi,  0}p  I  , 

with  dom/j  =  Rfc,  so  h  =  h.  This  function  is  convex,  and  nondecreasing,  so 
we  conclude  h(g(x))  is  a  convex  function  of  x.  For  z  >r  0,  we  have  h(z)  = 
z9)1^,  so  our  conclusion  is  that  gi(x)p)1^p  is  convex. 

•  The  geometric  mean  h{z)  =  (]Q^=1  Zi)1^  on  R^  is  concave  and  its  extension 
is  nondecreasing  in  each  argument.  It  follows  that  if  <ji, . . . ,  gk  are  nonnegative 
concave  functions,  then  so  is  their  geometric  mean,  cnLiso17*. 


Kz)  =  Y. 


3.2.5  Minimization 


We  have  seen  that  the  maximum  or  supremum  of  an  arbitrary  family  of  convex 
functions  is  convex.  It  turns  out  that  some  special  forms  of  minimization  also  yield 
convex  functions.  If®  is  convex  in  ( x,y ),  and  C  is  a  convex  nonempty  set,  then 
the  function 

g(x)  =  inf  f(x,  y) 


(3.16) 
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is  convex  in  x ,  provided  g(x)  >  —  oo  for  all  x.  The  domain  of  g  is  the  projection  of 
dom  /  on  its  x-coordinates,  i.e., 

doing  =  {x  |  (x,y)  £  dom /  for  some  y  £  C}. 

We  prove  this  by  verifying  Jensen’s  inequality  for  X\,  X2  £  domg.  Let  e  >  0. 
Then  there  are  yi,  y2  £  C  such  that  /(xj,g,; )  <  g(xi)  +  e  for  i  =  1,  2.  Now  let 
9  £  [0, 1].  We  have 

g{9x  1  +  (1  -  9) x2)  =  inf  f(6x  1  +  (1  -  9)x 2,  y) 

y&C 

<  f{0xi+(l-d)x2,9yi+(l-6)y2) 

<  9f(xi,  y1)  +  (1- 9)f(x2,  y2) 

<  9g{xi)  +  (1-  0)g(x2)  +  e. 

Since  this  holds  for  any  e  >  0,  we  have 

g(0x  1  +  (1  -  0)x2)  <  9g(xi)  +  (1  -  9)g(x 2). 

The  result  can  also  be  seen  in  terms  of  epigraphs.  With  /,  g,  and  C  defined  as 
in  (3.16),  and  assuming  the  infimum  over  y  £  C  is  attained  for  each  x,  we  have 

epig  =  {(a:,  t)  |  (. x,y,t )  £  epi /  for  some  y  £  C}. 

Thus  epi  g  is  convex,  since  it  is  the  projection  of  a  convex  set  on  some  of  its 
components. 


Example  3.15  Schur  complement.  Suppose  the  quadratic  function 
f(x,  y)  =  xT  Ax  +  2 xT  By  +  yTCy, 

(where  A  and  C  are  symmetric)  is  convex  in  ( x,y ),  which  means 


A  B 
Bt  C 


y  0. 


We  can  express  g{x)  =  infy  f(x,  y)  as 

g(x)  =  xt(A  —  BC^  Bt)x, 

where  C '  is  the  pseudo-inverse  of  C  (see  §A.5.4).  By  the  minimization  rule,  g  is 
convex,  so  we  conclude  that  A  —  BC'  BT  >  0. 

If  C  is  invertible,  i.e.,  C  0,  then  the  matrix  A  —  BC~1BT  is  called  the  Schur 
complement  of  C  in  the  matrix 

A  B 
Bt  C 

(see  §A.5.5). 


Example  3.16  Distance  to  a  set.  The  distance  of  a  point  x  to  a  set  S  C  Rn,  in  the 
norm  ||  •  ||,  is  defined  as 

dist(®,  S)  ~  inf  II*  —  y\\. 

y£  S 

The  function  \\x  —  y\\  is  convex  in  (x,  y),  so  if  the  set  S  is  convex,  the  distance  function 
dist(x,  S)  is  a  convex  function  of  *. 


3.2  Operations  that  preserve  convexity 
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Example  3.17  Suppose  h  is  convex.  Then  the  function  g  defined  as 

g(x)  =  ini{h(y)  \  Ay  =  x} 
is  convex.  To  see  this,  we  define  /  by 


f(x,y)  = 


h(y)  \iAy  =  x 
oo  otherwise, 


which  is  convex  in  (x,  y).  Then  g  is  the  minimum  of  /  over  y,  and  hence  is  convex. 
(It  is  not  hard  to  show  directly  that  g  is  convex.) 


3.2.6  Perspective  of  a  function 


If  /  :  R"  — »■  R,  then  the  perspective  of  /  is  the  function  g  :  R”+1  — >  R  defined  by 


g(x,t)  =  tf{x/t ), 


with  domain 

domg  =  {(x,t)  |  x/t  £  dom  /,  t  >  0}. 

The  perspective  operation  preserves  convexity!  If  /  is  a  convex  function,  then  so 
is  its  perspective  function  g.  Similarly,  if  /  is  concave,  then  so  is  g. 

This  can  be  proved  several  ways,  for  example,  direct  verification  of  the  defining 
inequality  (see  exercise  3.33).  We  give  a  short  proof  here  using  epigraphs  and  the 
perspective  mapping  on  R,l+1  described  in  §2.3.3  (which  will  also  explain  the  name 
‘perspective’).  For  t  >  0  we  have 

(x,  t,  s)  £  epi  g  <==>  tf(x/t)<s 

f{x/t)  <  s/t 
4=>  (x/t,  s/t)  £  epif. 

Therefore  epi  g  is  the  inverse  image  of  epi  /  under  the  perspective  mapping  that 
takes  (it,  v,  w)  to  (it,  w)/v.  It  follows  (see  §2.3.3)  that  epi  g  is  convex,  so  the  function 
g  is  convex. 


Example  3.18  Euclidean  norm  squared.  The  perspective  of  the  convex  function 
f(x)  =  xTx  on  Rn  is 

g(x,t)  =  t(x/t)T(x/t)  =  3^, 
which  is  convex  in  ( x ,  t)  for  t  >  0. 

We  can  deduce  convexity  of  g  using  several  other  methods.  First,  we  can  express  g  as 
the  sum  of  the  quadratic-over- linear  functions  xf/t,  which  were  shown  to  be  convex 
in  §3.1.5.  We  can  also  express  g  as  a  special  case  of  the  matrix  fractional  function 
xT (tl)~1x  (see  example  3.4). 
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Example  3.19  Negative  logarithm.  Consider  the  convex  function  f(x)  =  —  logs;  on 
R++.  Its  perspective  is 

g(x,t)  =  —t\og{x/t)  =  t\og{t/x)  =  tlogt  —  t  log®, 

and  is  convex  on  R.++.  The  function  g  is  called  the  relative  entropy  of  t  and  x.  For 
x  =  1,  g  reduces  to  the  negative  entropy  function. 

From  convexity  of  g  we  can  establish  convexity  or  concavity  of  several  interesting 
related  functions.  First,  the  relative  entropy  of  two  vectors  u,  v  £  R”  +  ,  defined  as 

n 

yy  Ui  iog( Ui/vi), 

i= 1 

is  convex  in  (u,v),  since  it  is  a  sum  of  relative  entropies  of  in,  Vi. 

A  closely  related  function  is  the  Kullback-Leibler  divergence  between  u,  v  £  R"  +  , 
given  by 

n 

Dk\(u,v)  =  ( Ui  \og(m/vi)  -Ui  +  Vi) ,  (3.17) 

i= 1 

which  is  convex,  since  it  is  the  relative  entropy  plus  a  linear  function  of  (u,v).  The 
Kullback-Leibler  divergence  satisfies  Dki(u,  v)  >  0,  and  D^i(u,v)  =  0  if  and  only  if 
u  —  v,  and  so  can  be  used  as  a  measure  of  deviation  between  two  positive  vectors;  see 
exercise  3.13.  (Note  that  the  relative  entropy  and  the  Kullback-Leibler  divergence 
are  the  same  when  u  and  v  are  probability  vectors,  i.e.,  satisfy  lTit  =  lTv  =  1.) 

If  we  take  Vi  =  1  Tu  in  the  relative  entropy  function,  we  obtain  the  concave  (and 
homogeneous)  function  of  u  £  R-++  given  by 

n  n 

Ui  log ( 1  u/ Ui )  =  (1  Tu)  Zi  log(l /zi), 

i= 1  i= 1 

where  z  =  u/(lTu),  which  is  called  the  normalized  entropy  function.  The  vector 
z  =  u/lTu  is  a  normalized  vector  or  probability  distribution,  since  its  components 
sum  to  one;  the  normalized  entropy  of  u  is  lTu  times  the  entropy  of  this  normalized 
distribution. 


Example  3.20  Suppose  /  :  Rm  — »  R  is  convex,  and  A  £  Rmxn,  b  £  Rm,  c  £  R", 
and  d  £  R.  We  define 

g(x)  =  (cT x  +  d)f  (( Ax  +  b)/(cTx  +  d))  , 

with 

domj  =  {*  |  cT x  +  d  >  0,  {Ax  +  b)/(cTx  +  d)  £  dom  /}. 

Then  g  is  convex. 


3.3  The  conjugate  function 

In  this  section  we  introduce  an  operation  that  will  play  an  important  role  in  later 
chapters. 


3.3  The  conjugate  function 
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Figure  3.8  A  function  /  :  R  — »  R,  and  a  value  y  G  R.  The  conjugate 
function  f*(y)  is  the  maximum  gap  between  the  linear  function  yx  and 
f(x),  as  shown  by  the  dashed  line  in  the  figure.  If  /  is  differentiable,  this 
occurs  at  a  point  x  where  f'(x)  =  y. 


3.3.1  Definition  and  examples 

Let  /  :  Rn  — >  R.  The  function  /* *  :  Rra  — >  R,  defined  as 

f*(y)=  sup  (yTx-f(x)),  (3.18) 

xGdom  / 

is  called  the  conjugate  of  the  function  /.  The  domain  of  the  conjugate  function 
consists  of  y  £  R"  for  which  the  supremum  is  finite,  i.e.,  for  which  the  difference 
yT x  —  f(x)  is  bounded  above  on  dom /.  This  definition  is  illustrated  in  figure  3.8. 

We  see  immediately  that  /*  is  a  convex  function,  since  it  is  the  pointwise 
supremum  of  a  family  of  convex  (indeed,  affine)  functions  of  y.  This  is  true  whether 
or  not  /  is  convex.  (Note  that  when  /  is  convex,  the  subscript  x  £  dom  /  is  not 
necessary  since,  by  convention,  yTx  —  /( x)  =  —  oo  for  x  dom/.) 

We  start  with  some  simple  examples,  and  then  describe  some  rules  for  conjugat¬ 
ing  functions.  This  allows  us  to  derive  an  analytical  expression  for  the  conjugate 
of  many  common  convex  functions. 


Example  3.21  We  derive  the  conjugates  of  some  convex  functions  on  R. 


Affine  function.  f(x)  =  ax  +  b.  As  a  function  of  x,  yx  —  ax  —  b  is  bounded  if 
and  only  if  y  =  a,  in  which  case  it  is  constant.  Therefore  the  domain  of  the 


function  xy+ log  x 

is  unbounded  above  if  y  >  0  and  reaches  its  maximum  at  x  =  —  1  /y  otherwise. 
Therefore,  dom  /*  =  {y  |  y  <  0}  =  —  R++  and  f*(y)  =  —  log(— y)  —  1  for  y  <  0. 


conjugate  function  /*  is  the  singleton  {a},  and  f*(a)  =  —b. 

Negative  logarithm.  f(x )  =  —  log  x,  with  dom  /  =  R++.  The 


•  Exponential.  f(x)  =  ex.  xy  —  ex  is  unbounded  if  y  <  0.  For  y  >  0,  xy  —  ex 
reaches  its  maximum  at  x  =  logy,  so  we  have  f*(y)  =  ylogy  —  y.  For  y  =  0, 
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f*(y)  =  sup.,,  -ex  =  0.  In  summary,  dom  /*  =  R+  and  f*(y)  =  ylogy  -  y 
(with  the  interpretation  OlogO  =  0). 

•  Negative  entropy.  f(x)  =  xlogx,  with  dom /  =  R+  (and  /( 0)  =  0).  The 
function  xy  —  xlog*  is  bounded  above  on  R+  for  all  y,  hence  dom/*  =  R.  It 
attains  its  maximum  afj*  =  ev~1 ,  and  substituting  we  find  — Ml 

•  Inverse.  f{x)  =  1/x  on  R++.  For  y  >  0,  yx  —  1/x  is  unbounded  above.  For 
y  =  0  this  function  has  supremum  0;  for  y  <  0  the  supremum  is  attained  at 
x  =  (— y)-1/2.  Therefore  we  have  f*(y)  =  —  2(— y)1/2,  with  dom/*  =  — R+. 


Example  3.22  Strictly  convex  quadratic  function.  Consider  f(x)  =  \ xTQx ,  with 
Q  €  The  function  yTx  —  \ xTQx  is  bounded  above  as  a  function  of  x  for  all  y. 

It  attains  its  maximum  at  x  =  Q~1y,  so 

f*(y)  =  \ yTQ~1y • 


Example  3.23  Log-determinant.  We  consider  f(X)  =  logdetX  1  on  S"  +  .  The 
conjugate  function  is  defined  as 

/*(Y)  =  sup  (tr(Y X)  +  logdet  X) , 

A'XO 

since  tr(YA')  is  the  standard  inner  product  on  S".  We  first  show  that  tr(YX)  + 
logdet X  is  unbounded  above  unless  Y  -<  0.  If  Y  -ft  0,  then  Y  has  an  eigenvector  v, 
with  || v || 2  =  1,  and  eigenvalue  A  >  0.  j  Taking  X  =  I  +  ti wfji  we  find  that 

tr  (YX)  +  logdetX  =  tr  Y  +  tA  +  logdet(J  +  tvvT)  =  tr  Y  +  tX  +  log(l  +  t), 

which  is  unbounded  above  as  t  — >  oo. 

Now  consider  the  case  Y  -<  0.  We  can  hnd  the  maximizing  X  by  setting  the  gradient 
with  respect  to  X  equal  to  zero: 

Vx  (tr (YX)  +  logdet  A)  =  Y  +  A'-1  =  0 

(see  §A.4.1),  which  yields  X  =  —  Y-1  (which  is,  indeed,  positive  definite).  Therefore 
we  have 

/*(Y)=logdet(-Y)-1-n, 

with  dom/*  =  —  S"  +  . 


Example  3.24  Indicator  function.  Let  Is  be  the  indicator  function  of  a  (not  neces¬ 
sarily  convex)  set  S  C  Rn,  i.e.,  Is(x)  =  0  on  dom  Is  =  S.  Its  conjugate  is 

I*s(y)  =  sup  yTx, 

which  is  the  support  function  of  the  set  S. 


3.3  The  conjugate  function 
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Example  3.25  Log-sum-exp  function.  To  derive  the  conjugate  of  the  log-sum-exp 
function  f(x)  =  log(^n=1  eXi),  we  first  determine  the  values  of  y  for  which  the 
maximum  over  x  of  yTx  —  f(x)  is  attained.  By  setting  the  gradient  with  respect  to 
x  equal  to  zero,  we  obtain  the  condition 


Vi  = 


i  =  1, . . , ,  n. 


These  equations  are  solvable  for  x  if  and  only  if  y  y  0  and  1  Ty  =  1.  By  substituting 
the  expression  for  yi  into  yTx—f(x)  we  obtain  f*(y)  =  Xo=i  V'  log 2 H-  This  expression 
for  /*  is  still  correct  if  some  components  of  y  are  zero,  as  long  as  y  >;  0  and  1  Ty  =  1, 
and  we  interpret  OlogO  as  0. 

In  fact  the  domain  of  f*  is  exactly  given  by  1  Ty  —  1,  y  X  0.  To  show  this,  suppose 
that  a  component  of  y  is  negative,  say,  yk  <  0.  Then  we  can  show  that  yTx  —  f(x)  is 
unbounded  above  by  choosing  Xk  =  —t,  and  Xi  =  0,  i  ^  fc,  and  letting  t  go  to  infinity. 

If  V  'l  0  but  1  Ty  ^  1,  we  choose  x  =  fl,  so  that 

yTx  -  f(x)  =  tlTy  -t-  logn. 

If  1  Ty  >  1,  this  grows  unboundedly  as  t  — >  oo;  if  1  Ty  <  1,  it  grows  unboundedly  as 
t  —>  —  oo. 

In  summary, 


f(y) 


Z)r=i  Vi  !°g  Vi  if  2/  —  0  and  1  Ty  =  1 
oo  otherwise. 


In  other  words,  the  conjugate  of  the  log-sum-exp  function  is  the  negative  entropy 
function,  restricted  to  the  probability  simplex. 


Example  3.26  Norm.  Let  ||  •  ||  be  a  norm  on  Rn,  with  dual  norm  ||  •  ||„.  We  will 
show  that  the  conjugate  of  /(*)  =  ||*||  is 

/*(„)  =  /  0 

I  oo  otherwise, 

i.e.,  the  conjugate  of  a  norm  is  the  indicator  function  of  the  dual  norm  unit  ball. 

If  Hj/II*  >  1,  then  by  definition  of  the  dual  norm,  there  is  a  z  £  R"  with  ||z||  <  1  and 
yT z  >  1.  Taking  x  =  tz  and  letting  t  —>  oo,  we  have 

yTx  -  ||a;||  =  t(yTz  -  ||z||)  -s-  oo, 

which  shows  that  f*(y)  =  oo.  Conversely,  if  ||y||*  <  1,  then  we  have  yTx  <  ||*||||y||* 
for  all  x,  which  implies  for  all  x,  yTx  —  ||*||  <  0.  Therefore  *  =  0  is  the  value  that 
maximizes  yTx  —  ||*||,  with  maximum  value  0. 


Example  3.27  Norm  squared.  Now  consider  the  function  /(*)  =  (1/2)||*||2,  where  ||  •  || 
is  a  norm,  with  dual  norm  ||  •  ||„.  We  will  show  that  its  conjugate  is  f*{y)  =  (1/2)  ||j/||*. 
From  yT x  <  ||j/||*||*||,  we  conclude 

yTx-(l/2)\\x\\2<\\y\\4x\\-(l/2)\\x\\2 
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for  all  x.  The  righthand  side  is  a  quadratic  function  of  ||x||,  which  has  maximum 
value  (1/2)||2/||*.  Therefore  for  all  x,  we  have 

yTx-(l/2)\\x\\2<(l/2)\\y\\l 

which  shows  that  f*(y)  <  (1  /2) || J/|| * ■ 

To  show  the  other  inequality,  let  x  be  any  vector  with  yTx  =  ||?/||„||:r||,  scaled  so  that 
INI  =  IMI*-  Then  we  have,  for  this  x, 

yTx-(l/2)\\x\\2  =  (l/2)\\y\\l, 
which  shows  that  f*(y)  >  (l/2)||j/||*. 


Example  3.28  Revenue  and  profit  functions.  We  consider  a  business  or  enterprise  that 
consumes  n  resources  and  produces  a  product  that  can  be  sold.  We  let  r  =  (n , . . . ,  rn) 
denote  the  vector  of  resource  quantities  consumed,  and  S(r)  denote  the  sales  revenue 
derived  from  the  product  produced  (as  a  function  of  the  resources  consumed).  Now 
let  pi  denote  the  price  (per  unit)  of  resource  i,  so  the  total  amount  paid  for  resources 
by  the  enterprise  is  pTr.  The  profit  derived  by  the  firm  is  then  S(r )  —pTr.  Let  us  fix 
the  prices  of  the  resources,  and  ask  what  is  the  maximum  profit  that  can  be  made,  by 
wisely  choosing  the  quantities  of  resources  consumed.  This  maximum  profit  is  given 
by 

M(p)  =  sup  (S(r)  —  pTr)  . 

r 

The  function  M (p)  gives  the  maximum  profit  attainable,  as  a  function  of  the  resource 
prices.  In  terms  of  conjugate  functions,  we  can  express  M  as 

m(P)  =  ( -sn-p ). 

Thus  the  maximum  profit  (as  a  function  of  resource  prices)  is  closely  related  to  the 
conjugate  of  gross  sales  (as  a  function  of  resources  consumed). 


3.3.2  Basic  properties 

Fenchel’s  inequality 

From  the  definition  of  conjugate  function,  we  immediately  obtain  the  inequality 

f{x)  +  f*(y)  >xTy 

for  all  x,  y.  This  is  called  Fenchel’s  inequality  (or  Young’s  inequality  when  /  is 
differentiable). 

For  example  with  f(x)  =  (1/2 )xTQx1  where  Q  €  S"  +  )  we  obtain  the  inequality 
xTy  <  (1/2 )xtQx  +  (1/2 )yTQ~1y- 

Conjugate  of  the  conjugate 

The  examples  above,  and  the  name  ‘conjugate’,  suggest  that  the  conjugate  of  the 
conjugate  of  a  convex  function  is  the  original  function.  This  is  the  case  provided  a 
technical  condition  holds:  if  /  is  convex,  and  /  is  closed  {i.e.,  epi  /  is  a  closed  set; 
see  §A.3.3),  then  /**  =  /.  For  example,  if  dom /  =  R",  then  we  have  /**  =  /, 
i.e.,  the  conjugate  of  the  conjugate  of  /  is  /  again  (see  exercise  3.39). 
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Differentiable  functions 

The  conjugate  of  a  differentiable  function  /  is  also  called  the  Legendre  transform 
of  /.  (To  distinguish  the  general  definition  from  the  differentiable  case,  the  term 
Fenchel  conjugate  is  sometimes  used  instead  of  conjugate.) 

Suppose  /  is  convex  and  differentiable,  with  dom  /  =  R".  Any  maximizer  x* 
of  yT x  —  f(x)  satisfies  y  =  V/(x*),  and  conversely,  if  x*  satisfies  y  =  V/(x*),  then 
x*  maximizes  yTx  —  /(x).  Therefore,  if  y  =  V/(x*),  we  have 

f*(y)  =  x*TVf(x*)-f(x*). 

This  allows  us  to  determine  f*{y)  for  any  y  for  which  we  can  solve  the  gradient 
equation  y  =  V/(z)  for  z. 

We  can  express  this  another  way.  Let  z  £  R"  be  arbitrary  and  define  y  =  'Vf(z). 
Then  we  have 

f*{y)  =  zTVf(z)-  f(z). 

Scaling  and  composition  with  affine  transformation 

For  a  >  0  and  b  g  R,  the  conjugate  of  g(x)  —  af(x)  +  b  is  g*(y)  =  af*(y/a)  —  b. 

Suppose  A  £  R”xn  is  nonsingular  and  b  £  Rn.  Then  the  conjugate  of  g(x)  — 
f(Ax  +  b)  is 

9*{y)  =  f*(A~Ty)  -  bT A~Ty, 
with  dom  g*  =  AT  dom  f* . 

Sums  of  independent  functions 

If  f(u,v)  =  fi{u)  +  f2(v),  where  f\  and  ft  are  convex  functions  with  conjugates 
/*  and  /|,  respectively,  then 

f*(w,z)  =  ff(w)  +  f£(z). 

In  other  words,  the  conjugate  of  the  sum  of  independent  convex  functions  is  the  sum 
of  the  conjugates.  (‘Independent’  means  they  are  functions  of  different  variables.) 


3.4  Quasiconvex  functions 

3.4.1  Definition  and  examples 

A  function  /  :  R"  — >■  R  is  called  quasiconvex  (or  unimodal)  if  its  domain  and  all 
its  sublevel  sets 

Sa  =  {x  £  dom /  |  f(x)  <  a}, 

for  a  £  R,  are  convex.  A  function  is  quasiconcave  if  —  /  is  quasiconvex,  i.e.,  every 
superlevel  set  (x  |  f{x)  >  a}  is  convex.  A  function  that  is  both  quasiconvex  and 
quasiconcave  is  called  quasilinear.  If  a  function  /  is  quasilinear,  then  its  domain, 
and  every  level  set  {x  |  f(x)  =  a}  is  convex. 
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b 


c 


Figure  3.9  A  quasiconvex  function  on  R.  For  each  a,  the  a-sublevel  set  Sa 
is  convex,  i.e.,  an  interval.  The  sublevel  set  Sa  is  the  interval  [a,b].  The 
sublevel  set  Sp  is  the  interval  (— oo,c]. 


For  a  function  on  R,  quasiconvexity  requires  that  each  sublevel  set  be  an  interval 
(including,  possibly,  an  infinite  interval).  An  example  of  a  quasiconvex  function  on 
R  is  shown  in  figure  3.9. 

Convex  functions  have  convex  sublevel  sets,  and  so  are  quasiconvex.  But  simple 
examples,  such  as  the  one  shown  in  figure  3.9,  show  that  the  converse  is  not  true. 


Example  3.29  Some  examples  on  R: 

•  Logarithm,  log*  on  R++  is  quasiconvex  (and  quasiconcave,  hence  quasilinear) . 

•  Ceiling  function.  ceil(*)  =  inf{z  £  Z  |  z  >  *}  is  quasiconvex  (and  quasicon¬ 
cave)  . 


These  examples  show  that  quasiconvex  functions  can  be  concave,  or  discontinuous. 
We  now  give  some  examples  on  R”. 


Example  3.30  Length  of  a  vector.  We  define  the  length  of  *  £  R”  as  the  largest 
index  of  a  nonzero  component,  i.e., 

/(*)  =  max{i  |  Xi  ^  0}. 

(We  define  the  length  of  the  zero  vector  to  be  zero.)  This  function  is  quasiconvex  on 
R",  since  its  sublevel  sets  are  subspaces: 

/(*)  <  a  <=>■  Xi  =  0  for  i  =  [aj  +  1, . . . ,  n. 


Example  3.31  Consider  /  :  R2  — >  R,  with  dom /  =  R+  and  /(*  1,2:2)  =  *1*2-  This 
function  is  neither  convex  nor  concave  since  its  Hessian 


v2/(*) 


0  1 
1  0 
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is  indefinite;  it  has  one  positive  and  one  negative  eigenvalue.  The  function  /  is 
quasiconcave,  however,  since  the  superlevel  sets 

{*  £  R+  |  *1*2  >  a} 

are  convex  sets  for  all  a.  (Note,  however,  that  /  is  not  quasiconcave  on  R“.) 


Example  3.32  Linear-fractional  function.  The  function 


fix)  = 


aTx  +  b 
cTx  +  d  ’ 


with  dom  /  =  {*  |  cTx  +  d  >  0},  is  quasiconvex,  and  quasiconcave,  i.e.,  quasilinear. 
Its  a-sublevel  set  is 


Sa  =  {*  |  cTx  +  d  >  0,  (aT x  +  b)/(cTx  +  d)  <  a} 
=  {x  |  cT x  +  d  >  0,  aT x  +  b  <  a(cT x  +  d)}, 


which  is  convex,  since  it  is  the  intersection  of  an  open  halfspace  and  a  closed  halfspace. 
(The  same  method  can  be  used  to  show  its  superlevel  sets  are  convex.) 


Example  3.33  Distance  ratio  function.  Suppose  a,b  £  R",  and  define 


fix) 


x  -  a.|| 2 
\x  -  b\\2  ' 


i.e.,  the  ratio  of  the  Euclidean  distance  to  a  to  the  distance  to  b.  Then  /  is  quasiconvex 
on  the  halfspace  {*  |  \\x  —  a||2  <  \\x  —  fo|| 2 } -  To  see  this,  we  consider  the  a-sublevel 
set  of  /,  with  a  <  1  since  f(x)  <  1  on  the  halfspace  {x  \  ||a;  —  a.|| 2  <  II*  —  6|| 2 } -  This 
sublevel  set  is  the  set  of  points  satisfying 


II*  —  a|| 2  <  a||*  —  fol^- 


Squaring  both  sides,  and  rearranging  terms,  we  see  that  this  is  equivalent  to 
(1  —  a2)xTx  —  2(a  —  a2b)Tx  +  aT a  —  a2bTb  <  0. 

This  describes  a  convex  set  (in  fact  a  Euclidean  ball)  if  a  <  1. 


Example  3.34  Internal  rate  of  return.  Let  *  =  (*o,  *1, . . . ,  xn)  denote  a  cash  flow 
sequence  over  n  periods,  where  *;  >  0  means  a  payment  to  us  in  period  i,  and  Xi  <  0 
means  a  payment  by  us  in  period  i.  We  define  the  present  value  of  a  cash  flow,  with 
interest  rate  r  >  0,  to  be 


n 

PV(*,  r)  =  Y+l  +  r)~lXi. 

i=0 

(The  factor  (1  +  r)~l  is  a  discount  factor  for  a  payment  by  or  to  us  in  period  i.) 

Now  we  consider  cash  flows  for  which  *0  <  0  and  *0  +  *1  +  •  •  •  +  xn  >  0.  This 
means  that  we  start  with  an  investment  of  |*o|  in  period  0,  and  that  the  total  of  the 
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remaining  cash  flow,  xi  +  •  •  •  +  xn,  (not  taking  any  discount  factors  into  account) 
exceeds  our  initial  investment. 

For  such  a  cash  flow,  PV(a;,0)  >  0  and  PV(i,  r)  — »  xo  <  0  as  r  — »  oo,  so  it  follows 
that  for  at  least  one  r  >  0,  we  have  PV(*,r)  =  0.  We  define  the  internal  rate  of 
return  of  the  cash  flow  as  the  smallest  interest  rate  r  >  0  for  which  the  present  value 
is  zero: 

IRR(z)  =  inf{r  >  0  |  PV(s,r)  =  0}. 

Internal  rate  of  return  is  a  quasiconcave  function  of  x  (restricted  to  xo  <  0,  x\  +  •  •  •  + 
Xn  >  0).  To  see  this,  we  note  that 

IRR(:r)  >  R  <=>  PV(r,r)  >  0  for  0  <  r  <  R. 

The  lefthand  side  defines  the  R-superlevel  set  of  IRR.  The  righthand  side  is  the 
intersection  of  the  sets  {x  \  PV(x,r)  >  0},  indexed  by  r,  over  the  range  0  <  r  <  R. 
For  each  r,  PV (x,  r)  >  0  defines  an  open  halfspace,  so  the  righthand  side  defines  a 
convex  set. 


3.4.2  Basic  properties 

The  examples  above  show  that  quasiconvexity  is  a  considerable  generalization  of 
convexity.  Still,  many  of  the  properties  of  convex  functions  hold,  or  have  analogs, 
for  quasiconvex  functions.  For  example,  there  is  a  variation  on  Jensen’s  inequality 
that  characterizes  quasiconvexity:  A  function  /  is  quasiconvex  if  and  only  if  dom  / 
is  convex  and  for  any  x,  y  £  dom  /  and  0  <  0  <  1, 

f(0x  +  (1  -  6)y)  <  max{/(i),  f(y)},  (3.19) 

i.e.,  the  value  of  the  function  on  a  segment  does  not  exceed  the  maximum  of 
its  values  at  the  endpoints.  The  inequality  (3.19)  is  sometimes  called  Jensen’s 
inequality  for  quasiconvex  functions,  and  is  illustrated  in  figure  3.10. 


Example  3.35  Cardinality  of  a  nonnegative  vector.  The  cardinality  or  size  of  a 
vector  x  £  R"  is  the  number  of  nonzero  components,  and  denoted  card(x).  The 
function  card  is  quasiconcave  on  R"  (but  not  Rn).  This  follows  immediately  from 
the  modified  Jensen  inequality 

card(a:  +  y)  >  min{card(a:),  card(y)}, 
which  holds  for  x,  y  y  0. 


Example  3.36  Rank  of  positive  semidefinite  matrix.  The  function  rank  A  is  quasi¬ 
concave  on  S”.  This  follows  from  the  modified  Jensen  inequality  (3.19), 

rank(A'  +  Y)  >  minjrank  X,  rankT} 

which  holds  for  X ,  Y  £  S".  (This  can  be  considered  an  extension  of  the  previous 
example,  since  rank(diag(*))  =  card(*)  for  x  Y  0.) 
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Figure  3.10  A  quasiconvex  function  on  R.  The  value  of  /  between  x  and  y 
is  no  more  than  ma x{f(x),f(y)}. 


Like  convexity,  quasiconvexity  is  characterized  by  the  behavior  of  a  function  / 
on  lines:  /  is  quasiconvex  if  and  only  if  its  restriction  to  any  line  intersecting  its 
domain  is  quasiconvex.  In  particular,  quasiconvexity  of  a  function  can  be  verified  by 
restricting  it  to  an  arbitrary  line,  and  then  checking  quasiconvexity  of  the  resulting 
function  on  R. 

Quasiconvex  functions  on  R 

We  can  give  a  simple  characterization  of  quasiconvex  functions  on  R.  We  consider 
continuous  functions,  since  stating  the  conditions  in  the  general  case  is  cumbersome. 
A  continuous  function  /  :  R  — >  R  is  quasiconvex  if  and  only  if  at  least  one  of  the 
following  conditions  holds: 

•  /  is  nondecreasing 

•  /  is  nonincreasing 

•  there  is  a  point  c  £  dom/  such  that  for  t  <  c  (and  t  £  dom  /),  /  is 
nonincreasing,  and  for  t  >  c  (and  t  £  dom/),  /  is  nondecreasing. 

The  point  c  can  be  chosen  as  any  point  which  is  a  global  minimizer  of  /.  Figure  3.11 
illustrates  this. 


3.4.3  Differentiable  quasiconvex  functions 

First-order  conditions 

Suppose  /  :  R"  — >  R  is  differentiable.  Then  /  is  quasiconvex  if  and  only  if  dom  / 
is  convex  and  for  all  x,  y  £  dom  / 

f(y)  <  /(*)  =►  V/0r)T(y  -  x)  <  0.  (3.20) 
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t  <  c  and  nondecreasing  for  t>c. 


V/(z) 


Figure  3.12  Three  level  curves  of  a  quasiconvex  function  /  are  shown.  The 
vector  V/(*)  defines  a  supporting  hyperplane  to  the  sublevel  set  {z  \  f(z)  < 
/(*)}  at  x. 


This  is  the  analog  of  inequality  (3.2),  for  quasiconvex  functions.  We  leave  the  proof 
as  an  exercise  (exercise  3.43). 

The  condition  (3.20)  has  a  simple  geometric  interpretation  when  V/(x)  ^  0.  It 
states  that  V/(x)  defines  a  supporting  hyperplane  to  the  sublevel  set  {y  \  f(y)  < 
f(x)},  at  the  point  x,  as  illustrated  in  figure  3.12. 

While  the  first-order  condition  for  convexity  (3.2),  and  the  first-order  condition 
for  quasiconvexity  (3.20)  are  similar,  there  are  some  important  differences.  For 
example,  if  /  is  convex  and  S7  f{x)  =  0,  then  a;  is  a  global  minimizer  of  /.  But  this 
statement  is  false  for  quasiconvex  functions:  it  is  possible  that  V/(a;)  =  0,  but  x 
is  not  a  global  minimizer  of  /. 
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Second-order  conditions 

Now  suppose  /  is  twice  differentiable.  If  /  is  quasiconvex,  then  for  all  x  £  dom  /, 
and  all  y  €  R",  we  have 

yTVf(x)  =  0  =►  yTV2f{x)y  >  0.  (3.21) 

For  a  quasiconvex  function  on  R,  this  reduces  to  the  simple  condition 

f'{x)  =  0  =>  f"(x)  >  0, 

i.e.,  at  any  point  with  zero  slope,  the  second  derivative  is  nonnegative.  For  a 
quasiconvex  function  on  R™,  the  interpretation  of  the  condition  (3.21)  is  a  bit 
more  complicated.  As  in  the  case  n  =  1,  we  conclude  that  whenever  V/(x)  =  0, 
we  must  have  V2/(cc)  y  0.  When  V/( x)  ^  0,  the  condition  (3.21)  means  that 
V2f(x)  is  positive  semidefinite  on  the  (n—  l)-dimensional  subspace  Vf(x)±.  This 
implies  that  \72  f(x)  can  have  at  most  one  negative  eigenvalue. 

As  a  (partial)  converse,  if  /  satisfies 

yTS7.f(x)  =0=^  yTV2f(x)y  >  0  (3.22) 

for  all  x  £  dom  /  and  all  y  £  R",  y  ^  0,  then  /  is  quasiconvex.  This  condition  is 
the  same  as  requiring  V2f(x)  to  be  positive  definite  for  any  point  with  Vf(x)  =  0, 
and  for  all  other  points,  requiring  V2  f(x)  to  be  positive  definite  on  the  (n  —  1)- 
dimensional  subspace  V/(x)_L. 

Proof  of  second-order  conditions  for  quasiconvexity 

By  restricting  the  function  to  an  arbitrary  line,  it  suffices  to  consider  the  case  in 
which  /  :  R  — >  R. 

We  first  show  that  if  /  :  R  — >  R  is  quasiconvex  on  an  interval  (a,b),  then  it 
must  satisfy  (3.21),  i.e.,  if  /'(c)  =  0  with  c  £  (a,  b ),  then  we  must  have  /"(c)  >  0.  If 
/'(c)  =  0  with  c  £  (a,  b),  /"(c)  <  0,  then  for  small  positive  e  we  have  /(c— e)  <  /(c) 
and  /(c  +  e)  <  /(c).  It  follows  that  the  sublevel  set  {x  \  f(x)  <  /(c)  —  e}  is 
disconnected  for  small  positive  e,  and  therefore  not  convex,  which  contradicts  our 
assumption  that  /  is  quasiconvex. 

Now  we  show  that  if  the  condition  (3.22)  holds,  then  /  is  quasiconvex.  Assume 
that  (3.22)  holds,  i.e.,  for  each  c  £  (a,  b)  with  /'(c)  =  0,  we  have  /"(c)  >  0.  This 
means  that  whenever  the  function  f  crosses  the  value  0,  it  is  strictly  increasing. 
Therefore  it  can  cross  the  value  0  at  most  once.  If  /'  does  not  cross  the  value 
0  at  all,  then  /  is  either  nonincreasing  or  nondecreasing  on  (a,  b),  and  therefore 
quasiconvex.  Otherwise  it  must  cross  the  value  0  exactly  once,  say  at  c  £  ( a,b ). 
Since  /"(c)  >  0,  it  follows  that  f'{t )  <  0  for  a  <  t  <  c,  and  f'(t)  >  0  for  c  <  t  <  b. 
This  shows  that  /  is  quasiconvex. 


3.4.4  Operations  that  preserve  quasiconvexity 

Nonnegative  weighted  maximum 

A  nonnegative  weighted  maximum  of  quasiconvex  functions,  i.e., 

f  =  max{tui/i, . . .  ,Wmfm}, 


102 


3  Convex  functions 


with  vj-i  >  0  and  /-,  quasiconvex,  is  quasiconvex.  The  property  extends  to  the 
general  pointwise  supremum 


f(x)  =  sup  (w(y)g(x,y)) 

yec 

where  w(y)  >  0  and  g(x,y)  is  quasiconvex  in  x  for  each  y.  This  fact  can  be  easily 
verified:  /( x)  <  a  if  and  only  if 

w(y)g(x ,  y)  <  a  for  all  y  €  C, 

i.e.,  the  a-sublevel  set  of  /  is  the  intersection  of  the  a-sublevel  sets  of  the  functions 
w(y)g(x,y)  in  the  variable  x. 


Example  3.37  Generalized  eigenvalue.  The  maximum  generalized  eigenvalue  of  a 
pair  of  symmetric  matrices  (X,Y),  with  Y  >-  0,  is  defined  as 

Amax  (X,  Y)  =  sup  =  sup{A  |  det(AF  -  X)  =  0}. 

u1  Yu 

(See  §A.5.3).  This  function  is  quasiconvex  on  dom  /  =  Sn  x  S"+. 

To  see  this  we  consider  the  expression 

\  /  v  v\  ^  Xu 

Amax(A,  Y  )  =  SUP  . 

u^0  uTYu 

For  each  u  ^  0,  the  function  uT Xu/uTYu  is  linear-fractional  in  (A,  Y),  hence  a 
quasiconvex  function  of  (A,  1').  We  conclude  that  Amax  is  quasiconvex,  since  it  is  the 
supremum  of  a  family  of  quasiconvex  functions. 


Composition 

If  g  :  R"  — >  R  is  quasiconvex  and  h  :  R  — >  R  is  nondecreasing,  then  /  =  h  o  g  is 
quasiconvex. 

The  composition  of  a  quasiconvex  function  with  an  affine  or  linear-fractional 
transformation  yields  a  quasiconvex  function.  If  /  is  quasiconvex,  then  g(x)  = 
f{Ax  +  b )  is  quasiconvex,  and  g(x)  =  f((Ax  +  b)/(cTx  +  d ))  is  quasiconvex  on  the 
set 

{x  |  cTx  +  d  >  0,  ( Ax  +  b)/(cTx  +  d)  £  dom/}. 


Minimization 

If  f(x,  y)  is  quasiconvex  jointly  in  x  and  y  and  C  is  a  convex  set,  then  the  function 

g(x)  =  inf  f{x,y) 


is  quasiconvex. 

To  show  this,  we  need  to  show  that  {x  \  g(x)  <  a}  is  convex,  where  a  £  R  is 
arbitrary.  From  the  definition  of  g,  g(x)  <  a  if  and  only  if  for  any  e  >  0  there  exists 
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a  y  €  C  with  f(x,  y)  <  a  +  e.  Now  let  x\  and  X2  be  two  points  in  the  a-sublevel 
set  of  g.  Then  for  any  e  >  0,  there  exists  y\,  t/2  G  C  with 

f(xi,yi)  <  a  +  e,  f(x2,y2)<a  +  e, 

and  since  /  is  quasiconvex  in  x  and  y,  we  also  have 

f(0x  1  +  (1  -  0)x2,dy1  +  (1  -  0)y2)  <a  +  e, 

for  0  <  6  <  1.  Hence  g(9x  1  +  (1  —  9)x 2)  <  a,  which  proves  that  {2:  |  g( x)  <  a}  is 
convex. 


3.4.5  Representation  via  family  of  convex  functions 

In  the  sequel,  it  will  be  convenient  to  represent  the  sublevel  sets  of  a  quasiconvex 
function  /  (which  are  convex)  via  inequalities  of  convex  functions.  We  seek  a  family 
of  convex  functions  4>t  '■  Rn  — 1 ►  R,  indexed  by  t  G  R,  with 

f(x)  <  t  <=>  <j>t (x)  <  0,  (3.23) 

i.e.,  the  t-sublevel  set  of  the  quasiconvex  function  /  is  the  0-sublevel  set  of  the 
convex  function  <j>t.  Evidently  </>t  must  satisfy  the  property  that  for  all  x  €  R”, 
<f>t{x)  <  0  ==>  <j>s(x)  <  0  for  s  >  t.  This  is  satisfied  if  for  each  x,  <j>t( x)  is  a 
nonincreasing  function  of  t,  i.e.,  4>s(x)  <  <j>t{x)  whenever  s  >t. 

To  see  that  such  a  representation  always  exists,  we  can  take 

,  ,  ,  f  0  /( x)  <  t 

<^t  ‘C  (  00  otherwise, 

i.e.,  <f>t  is  the  indicator  function  of  the  f-sublevel  of  /.  Obviously  this  representation 
is  not  unique;  for  example  if  the  sublevel  sets  of  /  are  closed,  we  can  take 

Mx)  =  dist  (ah  {z  |  f(z)  <  t}) . 

We  are  usually  interested  in  a  family  <pt  with  nice  properties,  such  as  differentia¬ 
bility. 


Example  3.38  Convex  over  concave  function.  Suppose  p  is  a  convex  function,  q  is  a 
concave  function,  with  p(x)  >  0  and  q(x)  >  0  on  a  convex  set  C .  Then  the  function 
/  defined  by  f(x)  =p(x)/q(x),  on  C,  is  quasiconvex. 

Here  we  have 

f{x)  <  t  <S=>  p(x)  -  tq(x)  <  0, 

so  we  can  take  <j>t(x)  =  p(x)  —  tq(x)  for  t  >  0.  For  each  t,  4>t  is  convex  and  for  each 
x,  4>t(x)  is  decreasing  in  t. 
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3.5  Log-concave  and  log-convex  functions 

3.5.1  Definition 

A  function  /  :  R"  — >  R  is  logarithmically  concave  or  log-concave  if  f(x)  >  0 
for  all  x  £  dom  /  and  log  /  is  concave.  It  is  said  to  be  logarithmically  convex 
or  log-convex  if  log  /  is  convex.  Thus  /  is  log-convex  if  and  only  if  1  //  is  log- 
concave.  It  is  convenient  to  allow  /  to  take  on  the  value  zero,  in  which  case  we 
take  log/(x)  =  — oo.  In  this  case  we  say  /  is  log-concave  if  the  extended-value 
function  log  /  is  concave. 

We  can  express  log-concavity  directly,  without  logarithms:  a  function  /  :  Rn  — > 
R,  with  convex  domain  and  f(x)  >  0  for  all  x  £  dom/,  is  log-concave  if  and  only 
if  for  all  x,  y  £  dom  /  and  0  <  9  <  1,  we  have 

f(6x  +  (1  -  0)y)  >  /(xffiy)1-8. 

In  particular,  the  value  of  a  log-concave  function  at  the  average  of  two  points  is  at 
least  the  geometric  mean  of  the  values  at  the  two  points. 

From  the  composition  rules  we  know  that  eh  is  convex  if  h  is  convex,  so  a  log- 
convex  function  is  convex.  Similarly,  a  nonnegative  concave  function  is  log-concave. 
It  is  also  clear  that  a  log-convex  function  is  quasiconvex  and  a  log-concave  function 
is  quasiconcave,  since  the  logarithm  is  monotone  increasing. 


Example  3.39  Some  simple  examples  of  log-concave  and  log- convex  functions. 

•  Affine  function.  f{x)  =  aTx  +  b  is  log-concave  on  {a:  |  aTx  +  b  >  0}. 

•  Powers.  f(x)  =  xa,  on  R++,  is  log-convex  for  a  <  0,  and  log-concave  for  a  >  0. 

•  Exponentials.  f(x)  =  eax  is  log-convex  and  log-concave. 

•  The  cumulative  distribution  function  of  a  Gaussian  density, 

4>(x)  =  ^ _  f  e~U  du’ 

is  log-concave  (see  exercise  3.54). 

•  Gamma  function.  The  Gamma  function, 

n  oo 

T{x)  =  /  ux~1e~u  du, 

J  o 

is  log-convex  for  x  >  1  (see  exercise  3.52). 

•  Determinant.  detX  is  log  concave  on  S"  +  . 

•  Determinant  over  trace.  detX/trX  is  log  concave  on  S"+  (see  exercise  3.49). 


Example  3.40  Log-concave  density  functions.  Many  common  probability  density 
functions  are  log-concave.  Two  examples  are  the  multivariate  normal  distribution, 


/(*) 


_ _ _ e-h(x~x)Ti: 

sj (27r)n  det  E 
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(where  x  £  Rn  and  E  £  S++),  and  the  exponential  distribution  on  R” 


fix) 


-at 


e 


X 


(where  A  >-  0).  Another  example  is  the  uniform  distribution  over  a  convex  set  C , 


/(*) 


1  /a  x  £  C 
0  x  £  C 


where  a  =  vol(C)  is  the  volume  (Lebesgue  measure)  of  C.  In  this  case  log  /  takes 
on  the  value  — oo  outside  C,  and  —  log  a  on  C,  hence  is  concave. 

As  a  more  exotic  example  consider  the  Wishart  distribution,  defined  as  follows.  Let 
xi, ...  ,xp  £  R"  be  independent  Gaussian  random  vectors  with  zero  mean  and  co- 
variance  E  £  Sn ,  with  p  >  n.  The  random  matrix  X  =  XixJ  has  the  Wishart 

density 

/( X)  =  a(detX)(p~"~1)/2e~^tr(s~lx\ 

with  dom  /  =  S++,  and  a  is  a  positive  constant.  The  Wishart  density  is  log-concave, 
since 

log  f(X)  =  log  a  +  P  ~  n2  ~  1  log  det  X tv^X), 
which  is  a  concave  function  of  X. 


3.5.2  Properties 

Twice  differentiable  log-convex/concave  functions 

Suppose  /  is  twice  differentiable,  with  dom  /  convex,  so 

V2  logf(x)  =  yhyV2/(a;)  -  TL_V/(x)V/0r)T 

We  conclude  that  /  is  log-convex  if  and  only  if  for  all  x  £  dom  /, 

.f(x)V2f(x)  h  V/(*)V/( x)T, 
and  log-concave  if  and  only  if  for  all  x  £  dom  /, 

f(x)V2f(x)  ±  V/(*)V/( x)T. 

Multiplication,  addition,  and  integration 

Log-convexity  and  log-concavity  are  closed  under  multiplication  and  positive  scal¬ 
ing.  For  example,  if  f  and  g  are  log-concave,  then  so  is  the  pointwise  product 
h(x)  =  f(x)g(x),  since  \ogh(x)  =  log  f(x)  +  log <?(#),  and  log  f(x)  and  log g(x)  are 
concave  functions  of  x. 

Simple  examples  show  that  the  sum  of  log-concave  functions  is  not,  in  general, 
log-concave.  Log-convexity,  however,  is  preserved  under  sums.  Let  /  and  g  be  log- 
convex  functions,  i.e.,  F  =  log  /  and  G  =  logg  are  convex.  From  the  composition 
rules  for  convex  functions,  it  follows  that 

log  (exp  F  +  exp  G)  =  log (/  +  g) 
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is  convex.  Therefore  the  sum  of  two  log-convex  functions  is  log-convex. 
More  generally,  if  f(x,  y )  is  log-convex  in  x  for  each  y  €  C  then 


is  log-convex. 


9(x) 


f(x,y)  dy 


Example  3.41  Laplace  transform  of  a  nonnegative  function  and  the  moment  and 
cumulant  generating  functions.  Suppose  p  :  R"  —>  R  satisfies  p(x)  >  0  for  all  x.  The 
Laplace  transform  of  p, 


is  log-convex  on  Rn.  (Here  domP  is,  naturally,  {z  \  P(z)  <  oo}.) 

Now  suppose  p  is  a  density,  i.e.,  satisfies  f  p(x)  dx  =  1.  The  function  M (z)  =  P(—z) 
is  called  the  moment  generating  function  of  the  density.  It  gets  its  name  from  the  fact 
that  the  moments  of  the  density  can  be  found  from  the  derivatives  of  the  moment 
generating  function,  evaluated  at  z  =  0,  e.g., 

VM(  0)  =  Ev,  V2M(  0)  =  E  vvT, 

where  v  is  a  random  variable  with  density  p. 

The  function  log  M(z),  which  is  convex,  is  called  the  cumulant  generating  function 
for  p,  since  its  derivatives  give  the  cumulants  of  the  density.  For  example,  the  first 
and  second  derivatives  of  the  cumulant  generating  function,  evaluated  at  zero,  are 
the  mean  and  covariance  of  the  associated  random  variable: 

VlogM(O)  =E»,  V2logA/(0)  =  E(v  —  Ev)(v  —  E»)T. 


Integration  of  log-concave  functions 

In  some  special  cases  log-concavity  is  preserved  by  integration.  If  /  :  R"  x  Rm  — >  R 
is  log-concave,  then 

9(x)  =  J  f{x,y)  dy 

is  a  log-concave  function  of  x  (on  R").  (The  integration  here  is  over  Rm.)  A  proof 
of  this  result  is  not  simple;  see  the  references. 

This  result  has  many  important  consequences,  some  of  which  we  describe  in 
the  rest  of  this  section.  It  implies,  for  example,  that  marginal  distributions  of  log- 
concave  probability  densities  are  log-concave.  It  also  implies  that  log-concavity  is 
closed  under  convolution,  i.e.,  if  /  and  g  are  log-concave  on  R",  then  so  is  the 
convolution 

(/  *  g){x)  =  J  f(x  -  y)g{y)  dy. 

(To  see  this,  note  that  g(y)  and  f(x—y)  are  log-concave  in  (x,  y),  hence  the  product 
f(x  —  y)g{y)  is;  then  the  integration  result  applies.) 
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Suppose  C  C  Rn  is  a  convex  set  and  w  is  a  random  vector  in  R"  with  log- 
concave  probability  density  p.  Then  the  function 

f(x)  =  prob(:r  +  m€C) 

is  log-concave  in  x.  To  see  this,  express  /  as 


f(x) 


J  g(x  +  w)p(w )  dw, 


where  g  is  defined  as 


9(u) 


1  u  e  C 
0  u  £  C, 


(which  is  log-concave)  and  apply  the  integration  result. 


Example  3.42  The  cumulative  distribution  function  of  a  probability  density  function 
/  :  Rn  — >  R  is  defined  as 

/Xn  PX1 

"  '  /  f(z )  dzi  ■  ■  ■  dZn, 

-  OO  J  —  OO 

where  w  is  a  random  variable  with  density  /.  If  /  is  log-concave,  then  F  is  log- 
concave.  We  have  already  encountered  a  special  case:  the  cumulative  distribution 
function  of  a  Gaussian  random  variable, 

fix)  =  -L=  f  e~t2/2  dt, 

J  ^ 

is  log-concave.  (See  example  3.39  and  exercise  3.54.) 


Example  3.43  Yield  function.  Let  x  €  R"  denote  the  nominal  or  target  value  of  a 
set  of  parameters  of  a  product  that  is  manufactured.  Variation  in  the  manufacturing 
process  causes  the  parameters  of  the  product,  when  manufactured,  to  have  the  value 
x  +  w,  where  w  €  R"  is  a  random  vector  that  represents  manufacturing  variation, 
and  is  usually  assumed  to  have  zero  mean.  The  yield  of  the  manufacturing  process, 
as  a  function  of  the  nominal  parameter  values,  is  given  by 

Y(x)  =  prob(a:  +  w  G  S), 

where  S  C  R"  denotes  the  set  of  acceptable  parameter  values  for  the  product,  i.e., 
the  product  specifications. 

If  the  density  of  the  manufacturing  error  w  is  log-concave  (for  example,  Gaussian)  and 
the  set  S  of  product  specifications  is  convex,  then  the  yield  function  Y  is  log-concave. 
This  implies  that  the  a- yield  region,  defined  as  the  set  of  nominal  parameters  for 
which  the  yield  exceeds  a,  is  convex.  For  example,  the  95%  yield  region 

{x  |  Y(x)  >  0.95}  =  {x  |  logV(x)  >  log  0.95} 

is  convex,  since  it  is  a  superlevel  set  of  the  concave  function  log  lb 
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Example  3.44  Volume  of  polyhedron.  Let  A  £  RmXn.  Define 

Pu  =  {x  £  R"  |  Ax  ^  u}. 

Then  its  volume  vol  Pu  is  a  log-concave  function  of  u. 

To  prove  this,  note  that  the  function 


T  ,  ,  [1  Ax  -<  u 

^X’U)  =  (o  otherwise, 

is  log-concave.  By  the  integration  result,  we  conclude  that 


\&( x ,  u)  dx  =  vol  Pu 


is  log-concave. 


3.6  Convexity  with  respect  to  generalized  inequalities 

We  now  consider  generalizations  of  the  notions  of  monotonicity  and  convexity,  using 
generalized  inequalities  instead  of  the  usual  ordering  on  R. 


3.6.1  Monotonicity  with  respect  to  a  generalized  inequality 

Suppose  K  C  R"  is  a  proper  cone  with  associated  generalized  inequality  ~Vik-  A 
function  f  :  Rn  — >  R  is  called  K- nondecreasing  if 

xPKy=A-  f(x)  <  f(y), 

and  K -increasing  if 

x  <k  y,  x  =>  f(x)  <  f(y). 

We  define  K- nonincreasing  and  K-decreasing  functions  in  a  similar  way. 


Example  3.45  Monotone  vector  functions.  A  function  /  :  R71  — »  R  is  nondecreasing 
with  respect  to  R”  if  and  only  if 

xi<yi,...,xn<yn  ==>  f{x)  <  f{y) 

for  all  x ,  y.  This  is  the  same  as  saying  that  /,  when  restricted  to  any  component  xi 
(i.e.,  Xi  is  considered  the  variable  while  Xj  for  are  fixed),  is  nondecreasing. 


Example  3.46  Matrix  monotone  functions.  A  function  /  :  S”  — >  R  is  called  ma¬ 
trix  monotone  (increasing,  decreasing)  if  it  is  monotone  with  respect  to  the  posi¬ 
tive  semidefinite  cone.  Some  examples  of  matrix  monotone  functions  of  the  variable 
X  £  S": 
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•  tr (WX),  where  W  £  Sn,  is  matrix  nondecreasing  if  W  0,  and  matrix  in¬ 
creasing  if  W  >-  0  (it  is  matrix  nonincreasing  if  W  ^0,  and  matrix  decreasing 
if  W  <  0). 

•  tr(A'^1)  is  matrix  decreasing  on  S"  +  . 

•  det  X  is  matrix  increasing  on  S"  +  ,  and  matrix  nondecreasing  on  S". 


Gradient  conditions  for  monotonicity 

Recall  that  a  differentiable  function  /  :  R  — >  R,  with  convex  («.e.,  interval)  domain, 
is  nondecreasing  if  and  only  if  f'(x)  >  0  for  all  x  €  dom /,  and  increasing  if 
f(x)  >  0  for  all  x  £  dom/  (but  the  converse  is  not  true).  These  conditions 
are  readily  extended  to  the  case  of  monotonicity  with  respect  to  a  generalized 
inequality.  A  differentiable  function  /,  with  convex  domain,  is  A'- nondecreasing  if 
and  only  if 

V/ Or)  yK.  0  (3.24) 

for  all  x  £  dom/.  Note  the  difference  with  the  simple  scalar  case:  the  gradi¬ 
ent  must  be  nonnegative  in  the  dual  inequality.  For  the  strict  case,  we  have  the 
following:  If 

V/(s)>-*>  0  (3.25) 

for  all  x  £  dom  /,  then  /  is  A'-increasing.  As  in  the  scalar  case,  the  converse  is 
not  true. 

Let  us  prove  these  first-order  conditions  for  monotonicity.  First,  assume  that 
/  satisfies  (3.24)  for  all  x,  but  is  not  A"-nondecreasing,  i.e.,  there  exist  x,  y  with 
x  diK  y  and  f(y)  <  f(x).  By  differentiability  of  /  there  exists  a  t.  £  [0, 1]  with 

^f(x  +  %  ^  2))  =  v/0  +  t(y  -  x))T(y  -  x)  <  0. 

Since  y  —  x  £  K  this  means 

Vf(x  +  t(y-x))#K*, 

which  contradicts  our  assumption  that  (3.24)  is  satisfied  everywhere.  In  a  similar 
way  it  can  be  shown  that  (3.25)  implies  /  is  AT-increasing. 

It  is  also  straightforward  to  see  that  it  is  necessary  that  (3.24)  hold  everywhere. 
Assume  (3.24)  does  not  hold  for  x  =  z.  By  the  definition  of  dual  cone  this  means 
there  exists  a  v  £  K  with 

S7,f{z)Tv  <  0. 

Now  consider  h(t)  =  f(z  +  tv)  as  a  function  of  t.  We  have  h'{ 0)  =  V f(z)Tv  <  0, 
and  therefore  there  exists  t  >  0  with  h(t)  =  f(z  +  tv)  <  h(0)  =  f(z),  which  means 
/  is  not  A"-nondecreasing. 


3.6.2  Convexity  with  respect  to  a  generalized  inequality 

Suppose  K  C  Rm  is  a  proper  cone  with  associated  generalized  inequality  ~<k-  We 
say  /  :  R"  — »■  Rm  is  K-convex  if  for  all  x,  y,  and  0  <  9  <  1, 

f{0x+{l-0)y)  <k  6f{x)  +  (l-0)f{y). 
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The  function  is  strictly  I\ -convex  if 

f(0x  +  (1  —  9)y)  6f{x)  +  (1  —  9)f{y) 

for  all  x  ^  y  and  0  <  9  <  1.  These  definitions  reduce  to  ordinary  convexity  and 
strict  convexity  when  m  =  1  (and  K  =  R+). 


Example  3.47  Convexity  with  respect  to  componentwise  inequality.  A  function  /  : 
R"  ->  R"'  is  convex  with  respect  to  componentwise  inequality  ( i.e .,  the  generalized 
inequality  induced  by  R+)  if  and  only  if  for  all  a:,  y  and  0  <  9  <  1, 

f(0x  +  {l-0)y)±0f{x)  +  {l-0)f(y), 

i.e.,  each  component  fi  is  a  convex  function.  The  function  /  is  strictly  convex  with 
respect  to  componentwise  inequality  if  and  only  if  each  component  /;  is  strictly  con¬ 
vex. 


Example  3.48  Matrix  convexity.  Suppose  /  is  a  symmetric  matrix  valued  function, 
ie.,  /  :  R”  — »  Sm.  The  function  /  is  convex  with  respect  to  matrix  inequality  if 

f(6x  +  (l-0)y)±ef(x)  +  (l-6)f(y) 

for  any  x  and  y,  and  for  6  G  [0, 1].  This  is  sometimes  called  matrix  convexity.  An 
equivalent  definition  is  that  the  scalar  function  zT  f(x)z  is  convex  for  all  vectors  2. 
(This  is  often  a  good  way  to  prove  matrix  convexity).  A  matrix  function  is  strictly 
matrix  convex  if 

f{0x  +  (l-0)y)*6f(x)  +  (l-e)f(y) 

when  x  y  and  0  <  9  <  1,  or,  equivalently,  if  zT  fz  is  strictly  convex  for  every  z  ^  0. 
Some  examples: 

•  The  function  f(X)  =  XXT  where  X  G  R"  Xm  is  matrix  convex,  since  for 
fixed  2  the  function  zTXXTz  =  ||A'T2||2  is  a  convex  quadratic  function  of  (the 
components  of)  X.  For  the  same  reason,  f(X)  =  X2  is  matrix  convex  on  Sn. 

•  The  function  Xp  is  matrix  convex  on  S"+  for  l<p<2or— l<p<0,  and 
matrix  concave  for  0  <  p  <  1. 

•  The  function  f(X)  =  ex  is  not  matrix  convex  on  Sn,  for  n  >  2. 


Many  of  the  results  for  convex  functions  have  extensions  to  AT-convex  functions. 
As  a  simple  example,  a  function  is  AT-convex  if  and  only  if  its  restriction  to  any 
line  in  its  domain  is  A'-convex.  In  the  rest  of  this  section  we  list  a  few  results  for 
A"-convexity  that  we  will  use  later;  more  results  are  explored  in  the  exercises. 

Dual  characterization  of  A"-convexity 

A  function  /  is  AT-convex  if  and  only  if  for  every  w  k *  0,  the  (real-valued)  function 
wT f  is  convex  (in  the  ordinary  sense);  /  is  strictly  AT-convex  if  and  only  if  for  every 
nonzero  w  0  the  function  wT f  is  strictly  convex.  (These  follow  directly  from 
the  definitions  and  properties  of  dual  inequality.) 
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Differentiable  A'-convex  functions 

A  differentiable  function  /  is  A'-convex  if  and  only  if  its  domain  is  convex,  and  for 
all  x,  y  £  dom /, 

f(y)  f(x)  +  Df(x)(y  -  x). 

(Here  Df[x )  £  Rm x n  is  the  derivative  or  Jacobian  matrix  of  /  at  x;  see  §A.4.1.) 
The  function  /  is  strictly  A'-convex  if  and  only  if  for  all  x,  y  €  dom  /  with  x  ^  y, 

f(y )  f~K  f{x)  +  Df(x)(y  -  x). 


Composition  theorem 

Many  of  the  results  on  composition  can  be  generalized  to  AT-convexity.  For  example, 
if  g  :  R”  —>  Rp  is  A^-convex,  h  :  Rp  — >  R  is  convex,  and  h  (the  extended-value 
extension  of  h)  is  AT-nondecreasing,  then  h  o  g  is  convex.  This  generalizes  the  fact 
that  a  nondecreasing  convex  function  of  a  convex  function  is  convex.  The  condition 
that  h  be  A"-nondecreasing  implies  that  dom  h  —  K  =  dom  h. 


Example  3.49  The  quadratic  matrix  function  g  :  Rmxrl  — S”  defined  by 
g(X)  =  XtAX  +  BtX  +  XtB  +  C, 

where  A  £  Sm,  B  £  Rmx”,  and  C  £  Sn,  is  convex  when  A  X  0. 

The  function  h  :  S’1  — »  R  defined  by  h(Y)  =  —  logdet(— Y)  is  convex  and  increasing 
on  dom  h  =  —  S"  +  . 

By  the  composition  theorem,  we  conclude  that 

f(X)  =  -  log  det(— (XtAA  +  BtX  +  XT B  +  C )) 

is  convex  on 

dom  /  =  {X  £  Rmxn  |  XT AX  +  BT X  +  XT B  +  C  -<  0}. 

This  generalizes  the  fact  that 

—  log(— {ax2  +  bx  +  c)) 

is  convex  on 

{*  £  R  |  ax 2  +  bx  +  c  <  0}, 

provided  a  >  0. 
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Exercises 

Definition  of  convexity 

3.1  Suppose  /  :  R  — >  R  is  convex,  and  a,  b  £  dom  /  with  a  <  b. 

(a)  Show  that 

/Or)  <  ^f{a)  +  P/(6) 

b  —  a  b  —  a 

for  all  x  £  [a,  b], 

(b)  Show  that 

/ Or)  -  /(a)  <  f{b)  -  f(a)  <  }(b)  -  f{x) 
x  —  a  ~  b  —  a  ~  b  —  x 

for  all  x  £  (a,  b).  Draw  a  sketch  that  illustrates  this  inequality. 

(c)  Suppose  /  is  differentiable.  Use  the  result  in  (b)  to  show  that 

/'<«)  <  ffluow  <  m. 

b  —  a 

Note  that  these  inequalities  also  follow  from  (3.2): 

f(b)  >  f(a)  +  f'(a)(b  -  a),  f(a)  >  f{b)  +  f'(b)(a  -  b). 

(d)  Suppose  /  is  twice  differentiable.  Use  the  result  in  (c)  to  show  that  f"(a)  >  0  and 
/"(&)  >  0. 

3.2  Level  sets  of  convex,  concave,  quasiconvex,  and  quasiconcave  functions.  Some  level  sets 
of  a  function  /  are  shown  below.  The  curve  labeled  1  shows  (a;  |  f(x)  =  1},  etc. 

3 


Could  /  be  convex  (concave,  quasiconvex,  quasiconcave)?  Explain  your  answer.  Repeat 
for  the  level  curves  shown  below. 


5 


6 
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3.3  Inverse  of  an  increasing  convex  function.  Suppose  /  :  R  — »  R  is  increasing  and  convex 
on  its  domain  ( a,b ).  Let  g  denote  its  inverse,  i.e.,  the  function  with  domain  (/(a),/(6)) 
and|5llKi33Fjs  a:  .for  a  <  x  <b.  What  can  you  say  about  convexity  or  i  concavity)  of  g ? 

3.4  [RV73,  page  15]  Show  that  a  continuous  function  /  :  R"  — »  R  is  convex  if  and  only  if  for 
every  line  segment,  its  average  value  on  the  segment  is  less  than  or  equal  to  the  average 
of  its  values  at  the  endpoints  of  the  segment:  For  every  x,  y  G  R", 


f(x  +  X  (y  —  xj )  dX  < 


f(x )  +  f(y) 


3.5  [RV73,  page  22]  Running  average  of  a  convex  function.  Suppose  /  :  R  — >  R  is  convex, 
with  R+  C  dom  /.  Show  that  its  running  average  F,  defined  as 


^  i 


fit)  dt, 


dom  F  =  R 


++> 


use  first  rule  after 
twice 

differentiation 


is  convex.  Hint.  For  each  s,  f{sx)  is  convex  in  x,  so  J-1  f(sx)  ds  is  convex. 

3.6  Functions  and  epigraphs.  When  is  the  epigraph  of  a  function  a  halfspace?  When  is  the 
epigraph  of  a  function  a  convex  cone?  When  is  the  epigraph  of  a  function  a  polyhedron? 


3.7  Suppose  /  :  Rn  — »  R  is  convex  with  dom/  =  R",  and  bounded  above  on  R".  Show  that 

/  is  constant.  suppose  it  is  not  constant  and  then  show  it  is  unbounded.  g(t)  =  f (x  +  t(y  -  x)) 

3.8  Second-order  condition  for  convexity.  Prove  that  a  twice  differentiable  function  /  is  convex 
if  and  only  if  its  domain  is  convex  and  V2/(x)  y  0  for  all  x  G  dom/.  Hint.  First  consider 
the  case  /  :  R  — »  R.  You  can  use  the  first-order  condition  for  convexity  (which  was  proved 
on  page  70). 


3.9  Second-order  conditions  for  convexity  on  an  affine  set.  Let  F  G  R" 


x  G  Rn.  The 


restriction  of  /  :  Rn 
/  :  Rm  ->  R  with 


-¥  R  to  the  affine  set  {Fz  +  x  \  z  G  R' 
f'(x)(y  -  x)  <  f(y)  -  f(x)  <  f'(y)(y  -  x). 


l}  is  defined  as  the  function 


f(z )  =  f(Fz  +  x),  dom  /  =  { z  \  Fz  +  x  G  dom  /}. 


Suppose  /  is  twice  differentiable  with  a  convex  domain. 

(a)  Show  that  /  is  convex  if  and  only  if  for  all  z  G  dom  / 

FTV2f{Fz  +  x)F  F  0. 

(b)  Suppose  A  G  Rpxn  is  a  matrix  whose  nullspace  is  equal  to  the  range  of  F,  i.e., 
AF  =  0  and  rankTl  =  n  —  rank  F.  Show  that  /  is  convex  if  and  only  if  for  all 
z  G  dom  /  there  exists  a  A  G  R  such  that 

V2f(Fz  +  x)  +  XATAy  0. 

Hint.  Use  the  following  result:  If  B  G  S"  and  A  G  Rpxn,  then  xT Bx  >  0  for  all 
x  G  A f(A)  if  and  only  if  there  exists  a  A  such  that  B  +  XAT A  y  0. 

3.10  An  extension  of  Jensen’s  inequality.  One  interpretation  of  Jensen’s  inequality  is  that 
randomization  or  dithering  hurts,  i.e.,  raises  the  average  value  of  a  convex  function:  For 
/  convex  and  v  a  zero  mean  random  variable,  we  have  E/(a;o  +  v)  >  f(x o).  This  leads 
to  the  following  conjecture.  If  /  is  convex,  then  the  larger  the  variance  of  v,  the  larger 

E/(*0  +  «)• 

(a)  Give  a  counterexample  that  shows  that  this  conjecture  is  false.  Find  zero  mean 
random  variables  v  and  w,  with  var(v)  >  var(ui),  a  convex  function  /,  and  a  point 
*o,  such  that  E  f(x o  +  v)  <  E  f(x o  +  w). 
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(b)  The  conjecture  is  true  when  v  and  w  are  scaled  versions  of  each  other.  Show  that 
E  f(xo  +  tv)  is  monotone  increasing  in  t>  0,  when  /  is  convex  and  v  is  zero  mean. 

3.11  Monotone  mappings.  A  function  ip  :  Rn  — »  R"  is  called  monotone  if  for  all  x,  y  £  domi/>, 

(ip(x)  -  ip(y))T {x  -  y)  >  0. 

(Note  that  ‘monotone’  as  defined  here  is  not  the  same  as  the  definition  given  in  §3.6.1. 
Both  definitions  are  widely  used.)  Suppose  /  :  R"  — ¥  R  is  a  differentiable  convex  function. 
Show  that  its  gradient  V/  is  monotone.  Is  the  converse  true,  i.e.,  is  every  monotone 
mapping  the  gradient  of  a  convex  function? 

3.12  Suppose  /  :  R"  — »  R  is  convex,  g  :  R  "  ->  R  is  concave,  dom/  =  dom  g  =  Rn,  and 
for  all  x,  g(x)  <  f(x).  Show  that  there  exists  an  affine  function  h  such  that  for  all  x, 
g(x)  <  h(x)  <  f(x).  In  other  words,  if  a  concave  function  g  is  an  underestimator  of  a 
convex  function  /,  then  we  can  fit  an  affine  function  between  /  and  g. 

3.13  Kullback-Leibler  divergence  and  the  information  inequality.  Let  Dki  be  the  Kullback- 
Leibler  divergence,  as  defined  in  (3.17).  Prove  the  information  inequality :  Z?ki(u,  v)  >  0 
for  all  u,  v  G  R++-  Also  show  that  Dki(w,  v)  =  0  if  and  only  if  u  =  v. 

Hint.  The  Kullback-Leibler  divergence  can  be  expressed  as 

D\ d(u,  v)  =  f(u)  -  f(v)  -  V/(u)t(m  -  v), 

where  f(v)  =  J27=i  Vi  loS  Vi  is  the  negative  entropy  of  v. 

3.14  Convex-concave  functions  and  saddle-points.  We  say  the  function  /  :  R"  x  Rm  — >  R 
is  convex- concave  if  f(x,  z)  is  a  concave  function  of  z,  for  each  fixed  x,  and  a  convex 
function  of  x,  for  each  fixed  z.  We  also  require  its  domain  to  have  the  product  form 
dom  f  =  A  x  B,  where  A  C  Rn  and  B  C  Rm  are  convex. 

(a)  Give  a  second-order  condition  for  a  twice  differentiable  function  /  :  R’1  x  Rm  — >  R 
to  be  convex-concave,  in  terms  of  its  Hessian  V2/(x,  z). 

(b)  Suppose  that  /  :  R"xRm  — >  R  is  convex-concave  and  differentiable,  with  V/(x,  z)  = 
0.  Show  that  the  saddle-point  property  holds:  for  all  x,  z,  we  have 

f{x,z)  <  f(x,z)  <  f{x,z). 

Show  that  this  implies  that  /  satisfies  the  strong  max-min  property. 

sup  inf  f(x,z)  =  inf  sup  f(x,z) 

Z  X  X  z 

(and  their  common  value  is  f(x,z)). 

(c)  Now  suppose  that  /  :  R"  x  Rm  — >  R  is  differentiable,  but  not  necessarily  convex- 
concave,  and  the  saddle-point  property  holds  at  x,  z: 

f{x,z)  <  f(x,z)  <  f{x,z) 

for  all  x,  z.  Show  that  V/(x,  z)  =  0. 

Examples 

3.15  A  family  of  concave  utility  functions.  For  0  <  a  <  1  let 

i”  -  1 

ua{x)  =  - , 

a 

with  domtia  =  R+.  We  also  define  uo(x)  =  logx  (with  domtio  =  R++). 

(a)  Show  that  for  x  >  0,  uo(x)  =  lima_>o  ua(x). 
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(b)  Show  that  ua  are  concave,  monotone  increasing,  and  all  satisfy  Ma(l)  =  0. 

These  functions  are  often  used  in  economics  to  model  the  benefit  or  utility  of  some  quantity 
of  goods  or  money.  Concavity  of  ua  means  that  the  marginal  utility  (i.e.,  the  increase 
in  utility  obtained  for  a  fixed  increase  in  the  goods)  decreases  as  the  amount  of  goods 
increases.  In  other  words,  concavity  models  the  effect  of  satiation. 

3.16  For  each  of  the  following  functions  determine  whether  it  is  convex,  concave,  quasiconvex, 
or  quasiconcave. 

(a)  f(x)  =  ex  —  1  on  R.  convex, quasiconvex, quasiconcave 

(b)  /(xi,  x2 )  =  xix2  on  R++.  quasiconcave 

„  convex, quasiconvex 

(c)  f\x ux2j  =  1/ \xix2)  on  R++.  quasilinear 

(d)  f(x  1,  x2)  =  xi/x2  on  R++. 

(e)  f(x i,  x2)  =  x\/x2  on  R  x  R++. 

(f)  /(x i,x2)  =  x“  x\~a,  where  0  <  a  <  1,  on  R++. 

3.17  Suppose  p  <  1,  p  ^  0.  Show  that  the  function 


with  dom/  =  R++  is  concave.  This  includes  as  special  cases  /(x)  =  (^"=1  x\  2)2  and 
the  harmonic  mean  /(x)  =  1/xi)-1.  Hint.  Adapt  the  proofs  for  the  log-sum-exp 

function  and  the  geometric  mean  in  §3.1.5. 

3.18  Adapt  the  proof  of  concavity  of  the  log-determinant  function  in  §3.1.5  to  show  the  follow¬ 
ing. 

(a)  f(X)  =  tr  (A-1)  is  convex  on  dom  /  =  S"  +  . 

(b)  /(A)  =  (det  A)1/"  is  concave  on  dom  /  =  S™  +  . 

3.19  Nonnegative  weighted  sums  and  integrals. 

(a)  Show  that  /(x)  =  aix[H  is  a  convex  function  of  x,  where  ou  >  a2  >  •  ■  •  > 

ar  >  0,  and  x^j  denotes  the  ith  largest  component  of  x.  (You  can  use  the  fact  that 

/(x )  =  J2i= i  x[i]  is  convex  on  Rn.) 

(b)  Let  T(x,oj)  denote  the  trigonometric  polynomial 

T(x,  oj)  =  xi  +  x2  cosw  +  X3  cos2cu  -| - +  xn  cos(n  —  l)cu. 


Show  that  the  function 

/(») 


logT(x,w)  duj 


is  convex  on  {x  £  R"  |  T(x,  u>)  >  0,  0  <  cj  <  27r}. 

3.20  Composition  with  an  affine  function.  Show  that  the  following  functions  /  :  Rn  — »  R  are 
convex. 


(a)  /(x)  =  ||Ax  —  6||,  where  A  £  RmX71,  b  £  Rm,  and  ||  -  ||  is  a  norm  on  Rm. 

(b)  /(x)  =  —  (det(A0  +  xi Ai  -\ - +  x„An))1/m,  on  {x  |  A0  +  xiAi  H - 1- x„An  >-  0}, 

where  Ai  £  Sm. 

(c)  /(A)  =  tr  (A0  -I-  xiAi  H - h  xnAn)~1,  on  {x  |  Ao  +  xiAi  -| - hxnAn  y  0},  where 

Ai  £  Sm.  (Use  the  fact  that  tr(A'_1)  is  convex  on  S++;  see  exercise  3.18.) 
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3.21  Pointwise  maximum  and  supremum.  Show  that  the  following  functions  /  :  Rn  — >  R  are 
convex. 

(a)  f(x)  =  maXj=iT...,k  H-A^a:  —  || ,  where  A ^  £  RmX",  £  Rm  and  ||  •  ||  is  a  norm 

on  Rm. 

(b)  f(x )  =  M[«]  011  R-">  where  |*|  denotes  the  vector  with  \x\ i  =  \xi\  ( i.e \x\  is 

the  absolute  value  of  x,  componentwise),  and  \x\u]  is  the  ith  largest  component  of 
|a:| .  In  other  words,  |*|m,  |*|[2],  ■  ■  ■ ,  |*|[n]  are  the  absolute  values  of  the  components 
of  x,  sorted  in  nonincreasing  order. 

3.22  Composition  rules.  Show  that  the  following  functions  are  convex. 

(a)  f(x)  =  —  log(—  log(^)™  i  eafx+bi ))  on  dom  /  =  {x  \  Yh=i  ea?x+bi  <  1}.  You  can 
use  the  fact  that  log(^"=1  eVi)  is  convex. 

(b)  f(x,u,v)  =  —vuv  —  xTx  on  dom  /  =  {(x,u,v)  \  uv  >  xTx,  u,  v  >  0}.  Use  the 
fact  that  xTx/u  is  convex  in  (x,  u)  for  u  >  0,  and  that  —yjx  1X2  is  convex  on  R++. 

(c)  f(x ,  u,v)  =  —  log(uu  —  xT x)  on  dom  f  =  {(x,  u,  v)  \  uv  >  xTx,  u,  v  >  0}. 

(d)  /(*,  t )  =  —  (tp  —  ||*||£)1/p  where  p  >  1  and  dom  f  =  {(x,t)  \  t  >  ||a;||p}.  You  can  use 
the  fact  that  ||a;||p/Mp_1  is  convex  in  ( x,u )  for  u  >  0  (see  exercise  3.23),  and  that 
—x1^py1~1/p  is  convex  on  R^_  (see  exercise  3.16). 

(e)  f(x,t )  =  —  log(tp  —  ||a;||p)  where  p  >  1  and  dom /  =  {(x,t)  \  t  >  ||x||p}.  You  can 
use  the  fact  that  ||a;||p/Mp_1  is  convex  in  (x,  u )  for  u  >  0  (see  exercise  3.23). 

3.23  Perspective  of  a  function. 

(a)  Show  that  for  p  >  1, 


|zi|p  +  ---  +  Mp 

tp-i 


(b) 


is  convex  on  {(*,  t)  \  t  >  0}. 
Show  that 


fix)  = 


II^E  +  frll 

cTx  +  d 


is  convex  on  { x  \  cTx  +  d  >  0},  where  A  £  R" 


2 

2 


,  b  £  Rm,  c  £  R'1  and  d  £  R. 


3.24  Some  functions  on  the  probability  simplex.  Let  x  be  a  real-valued  random  variable  which 
takes  values  in  {ai,...,a„}  where  ai  <  02  <■■■  <  an,  with  prob(a:  =  at)  =  Pi, 
i  =  1 , ...  ,n.  For  each  of  the  following  functions  of  p  (on  the  probability  simplex  {p  £ 
R+  I  1  tp  =  i}),  determine  if  the  function  is  convex,  concave,  quasiconvex,  or  quasicon¬ 
cave. 


(a)  E*. 

(b)  prob(a;  >  a). 

(c)  prob(a  <  x  <  ft). 

(d)  y^"_,  Pi  log  the  negative  entropy  of  the  distribution. 

(e)  var x  =  E(*  —  Ei)2. 

(f)  quartile(x)  =  inf{/3  |  prob(*  <  /3)  >  0.25}. 

(g)  The  cardinality  of  the  smallest  set  A  C  (ai, . . . ,  an}  with  probability  >  90%.  (By 
cardinality  we  mean  the  number  of  elements  in  A.) 

(h)  The  minimum  width  interval  that  contains  90%  of  the  probability,  i.e., 
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3.25  Maximum,  probability  distance  between  distributions.  Let  p,  q  £  R’1  represent  two  proba¬ 
bility  distributions  on  {1, . . .  ,n}  (so  p,  q  X  0,  1  Tp  =  1  Tq  =  1).  We  define  the  maximum 
probability  distance  dmp(p,  q)  between  p  and  q  as  the  maximum  difference  in  probability 
assigned  by  p  and  q,  over  all  events: 

dulp(p,q)  =  max{|  prob(p,  C)  -  prob(g,  C)\  \  C  C  {1, .  n}}. 

Here  prob (p,C)  is  the  probability  of  C,  under  the  distribution  p,  i.e.,  prob(p,  C)  = 

Find  a  simple  expression  for  dmp,  involving  ||p  —  g||i  =  Xy"=i  I  Pi  ~  9*1:  and  show  that  dmp 
is  a  convex  function  on  R’1  x  R".  (Its  domain  is  {(p,  q)  |  p,  qY  0,  lTp  =  lTg  =  1},  but 
it  has  a  natural  extension  to  all  of  R"  x  Rn.) 

3.26  More  functions  of  eigenvalues.  Let  Ai(A)  >  A2(A)  >  •  •  •  >  A„(A')  denote  the  eigenvalues 
of  a  matrix  I  6  S”.  We  have  already  seen  several  functions  of  the  eigenvalues  that  are 
convex  or  concave  functions  of  A'. 

•  The  maximum  eigenvalue  Ai(A')  is  convex  (example  3.10).  The  minimum  eigenvalue 
A„(A)  is  concave. 

•  The  sum  of  the  eigenvalues  (or  trace),  tr  A  =  Ai(A')  +  •  •  •  +  A„(A),  is  linear. 

•  The  sum  of  the  inverses  of  the  eigenvalues  (or  trace  of  the  inverse),  tr(A_1)  = 
y)"_T  1/A;(A),  is  convex  on  S”+  (exercise  3.18). 

•  The  geometric  mean  of  the  eigenvalues,  (det  X)  i/n  =  (nr=  =1Ai(A))1/n,  and  the 
logarithm  of  the  product  of  the  eigenvalues,  log  det  X  —  y~)  .=1  log  Ai  (A ) ,  are  concave 
on  X  €  S”+  (exercise  3.18  and  page  74). 

In  this  problem  we  explore  some  more  functions  of  eigenvalues,  by  exploiting  variational 
characterizations . 

(a)  Sum  of  k  largest  eigenvalues.  Show  that  A;(A')  is  convex  on  S".  Hint.  [HJ85, 

page  191]  Use  the  variational  characterization 

k 

A i{X)  =  sup{tr(UTAU)  |  V  £  Rnxfc,  VTV  =  I}. 

i=  1 

(b)  Geometric  mean  of  k  smallest  eigenvalues.  Show  that  or=n  -k+i  A i(X))1^  is  con¬ 
cave  on  S"  +  .  Hint.  [M079,  page  513]  For  X  y  0,  we  have 

„  \  i/fc 

J^[  Ai(A)  j  =  j  inf{tr(UTAU)  |  V  £  R"xk,  det  VTV  =  1}. 

i=n-k-\- 1  / 

(c)  Log  of  product  of  k  smallest  eigenvalues.  Show  that  ,  1  log  A*  (A)  is  concave 

on  S"  +  .  Hint.  [M079,  page  513]  For  A  >-  0, 

n  (  k 

Ai(A)  =  infi  [|(UtAU) 

i=n  —  fc+1  ^  i= 1 

3.27  Diagonal  elements  of  Cholesky  factor.  Each  A'  es;+  has  a  unique  Cholesky  factorization 
X  =  LLt  ,  where  L  is  lower  triangular,  with  La  >  0.  Show  that  La  is  a  concave  function 
of  A  (with  domain  S"  +  ). 

Hint.  La  can  be  expressed  as  La  =  (w  —  zTV-1  z)1^2 ,  where 

Y  z 

T 

z  w 


is  the  leading  i  x  i  submatrix  of  A. 
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Operations  that  preserve  convexity 

3.28  Expressing  a  convex  function  as  the  pointwise  supremum  of  a  family  of  affine  functions. 
In  this  problem  we  extend  the  result  proved  on  page  83  to  the  case  where  dom  /  yf  Rn. 
Let  /  :  R"  ->Rbea  convex  function.  Define  /  :  R”  — ►  R  as  the  pointwise  supremum  of 
all  affine  functions  that  are  global  underestimators  of  /: 

f{x)  =  sup{p(x)  |  g  affine,  g(z)  <  f{z)  for  all  zj. 


(a)  Show  that  f(x)  =  f(x)  for  x  £  int  dom/. 

(b)  Show  that  /  =  /  if  /  is  closed  ( i.e .,  epi  /  is  a  closed  set;  see  §A.3.3). 

3.29  Representation  of  piecewise-linear  convex  functions.  A  function  /  :  Rn  — »  R,  with 
dom  /  =  Rn ,  is  called  piecewise-linear  if  there  exists  a  partition  of  R’1  as 

R"  =  Ai  U  X2  U  •  •  •  U  XL, 

where  int  X;  ^  0  and  int  A/  D  int  Xj  =  0  for  i  ^  j,  and  a  family  of  affine  functions 
af  x  +  fti,  . . . ,  aj,x  +  Bl  such  that  f(x)  =  a[ x  +  bi  for  x  €  X\. 

Show  that  this  means  that  f{x)  =  max{ofx  +  6i, . . . , a^x  +  &n}. 

3.30  Convex  hull  or  envelope  of  a  function.  The  convex  hull  or  convex  envelope  of  a  function 
/  :  R“  — >  R  is  defined  as 

g(x)  =  inf{f  |  (x,t)  £  convepi/}. 


Geometrically,  the  epigraph  of  g  is  the  convex  hull  of  the  epigraph  of  /. 

Show  that  g  is  the  largest  convex  underestimator  of  /.  In  other  words,  show  that  if  h  is 
convex  and  satisfies  h(x)  <  f(x)  for  all  x,  then  h(x)  <  g(x)  for  all  x. 


3.31 


[Roc70,  page  35]  Largest  homogeneous  underestimator.  Let  /  be  a  convex  function.  Define 
the  function  g  as 


g(x)  =  inf 

CK  >  0 


f{ax) 

a 


(a)  Show  that  g  is  homogeneous  ( g(tx )  =  t.g(x)  for  all  t.  >  0). 

(b)  Show  that  g  is  the  largest  homogeneous  underestimator  of  /:  If  ft  is  homogeneous 
and  h(x)  <  f(x)  for  all  x,  then  we  have  ft/a:)  <  g(x)  for  all  x. 

(c)  Show  that  g  is  convex. 

3.32  Products  and  ratios  of  convex  functions.  In  general  the  product  or  ratio  of  two  convex 
functions  is  not  convex.  However,  there  are  some  results  that  apply  to  functions  on  R. 
Prove  the  following. 

(a)  If  /  and  g  are  convex,  both  nondecreasing  (or  nonincreasing),  and  positive  functions 
on  an  interval,  then  fg  is  convex. 

(b)  If  /,  g  are  concave,  positive,  with  one  nondecreasing  and  the  other  nonincreasing, 
then  fg  is  concave. 

(c)  If  /  is  convex,  nondecreasing,  and  positive,  and  g  is  concave,  nonincreasing,  and 
positive,  then  f/g  is  convex. 

3.33  Direct  proof  of  perspective  theorem.  Give  a  direct  proof  that  the  perspective  function  g, 
as  defined  in  §3.2.6,  of  a  convex  function  /  is  convex:  Show  that  domg  is  a  convex  set, 
and  that  for  (x,t),  ( y,s )  £  domj,  and  0  <  0  <  1,  we  have 


g(6x  +  (1  -  0)y,  6t  +  (1  -  6)s)  <  0g(x,  t)  +  (1  -  8)g(y,  s). 


3.34  The  Minkowski  function.  The  Minkowski  function  of  a  convex  set  C  is  defined  as 

Mc(x)  =  inf{t  >  0  |  t_1x  £  C}. 
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(a)  Draw  a  picture  giving  a  geometric  interpretation  of  how  to  find  Mc(x). 

(b)  Show  that  Me  is  homogeneous,  i.e.,  Mc(ax)  =  aMc(x)  for  a  >  0. 

(c)  What  is  domAfc? 

(d)  Show  that  Me  is  a  convex  function. 

(e)  Suppose  C  is  also  closed,  bounded,  symmetric  (if  x  £  C  then  —  x  £  C),  and  has 
nonempty  interior.  Show  that  Me  is  a  norm.  What  is  the  corresponding  unit  ball? 

3.35  Support  function  calculus.  Recall  that  the  support  function  of  a  set  C  C  R"  is  defined  as 
Sc(y)  =  sup{i/T£  |  x  £  C}.  On  page  81  we  showed  that  Sc  is  a  convex  function. 

(a)  Show  that  S b  —  Sc onv  b  • 

(b)  Show  that  Sa+b  =  Sa  +  Sb- 

(c)  Show  that  Saub  =  max{SA,  Sb}- 

(d)  Let  B  be  closed  and  convex.  Show  that  A  C  B  if  and  only  if  Sa{v)  <  Sb(v)  for  all 
V- 

Conjugate  functions 

3.36  Derive  the  conjugates  of  the  following  functions. 

(a)  Max  function.  f(x)  =  maxj=i,...!rl  Xi  on  R”. 

(b)  Sum  of  largest  elements.  f(x)  =  y^_,  x\g  on  R". 

(c)  Piecewise-linear  function  on  R.  f(x)  =  max;=i,...iTn(aia:  +  bi)  on  R.  You  can 
assume  that  the  m  are  sorted  in  increasing  order,  i.e.,  ai  <  •  •  •  <  am,  and  that  none 
of  the  functions  aix  +  bi  is  redundant,  i.e.,  for  each  k  there  is  at  least  one  x  with 
f{x)  =  akX  +  bk- 

(d)  Power  function.  f(x)  =  xp  on  R++,  where  p  >  1.  Repeat  for  p  <  0. 

(e)  Negative  geometric  mean.  f(x)  =  —  (Wxi)1^  on  R++. 

(f)  Negative  generalized  logarithm  for  second-order  cone.  f(x,  t)  =  —  log(t2  —  xT x)  on 
{( x,t )  £  R"  x  R  |  ||a:|| 2  <  t}. 

3.37  Show  that  the  conjugate  of  /(A')  =  tr(A'_1)  with  dom  /  =  S"+  is  given  by 

f*(Y)  =  -2tr  (— Y)1/2,  dom/*  =  -S£. 

Hint.  The  gradient  of  /  is  V/( X)  =  —A'-2. 

3.38  Young’s  inequality.  Let  /  :  R  ->  R  be  an  increasing  function,  with  /( 0)  =  0,  and  let  g  be 
its  inverse.  Define  F  and  G  as 

F{x)=  f  f(a)  da,  G(y)  =  f  g(a)da. 

Jo  Jo 

Show  that  F  and  G  are  conjugates.  Give  a  simple  graphical  interpretation  of  Young’s 
inequality, 

xy  <  F(x)  +  G{y). 

3.39  Properties  of  conjugate  functions. 

(a)  Conjugate  of  convex  plus  affine  function.  Define  g(x)  =  f(x)  +  cT x  +  d,  where  /  is 
convex.  Express  g*  in  terms  of  /*  (and  c,  d). 

(b)  Conjugate  of  perspective.  Express  the  conjugate  of  the  perspective  of  a  convex 
function  /  in  terms  of  /*. 
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(c)  Conjugate  and  minimization.  Let  /(x,  z)  be  convex  in  (x,  z)  and  define  g(x)  = 
inf2  /(x,  z).  Express  the  conjugate  g*  in  terms  of  /*. 

As  an  application,  express  the  conjugate  of  g(x)  =  inf  Z{h(z)  |  Az  +  b  —  x},  where  h 
is  convex,  in  terms  of  h* ,  A,  and  b. 

(d)  Conjugate  of  conjugate.  Show  that  the  conjugate  of  the  conjugate  of  a  closed  convex 
function  is  itself:  /  =  /**  if  /  is  closed  and  convex.  (A  function  is  closed  if  its 
epigraph  is  closed;  see  §A.3.3.)  Hint.  Show  that  /**  is  the  pointwise  supremum  of 
all  affine  global  underestimators  of  /.  Then  apply  the  result  of  exercise  3.28. 

3.40  Gradient  and  Hessian  of  conjugate  function.  Suppose  /  :  R“  — »  R  is  convex  and  twice 
continuously  differentiable.  Suppose  y  and  x  are  related  by  y  =  V/(x),  and  that  V2/(x)  >- 
0. 


(a)  Show  that  V/*(y)  =  x. 

(b)  Show  that  V2/*(y)  =  V2/(x)_1. 


3.41  Conjugate  of  negative  normalized  entropy.  Show  that  the  conjugate  of  the  negative  nor¬ 
malized  entropy 

n 

/(x)  =  log(xi/lTa;), 

i= 1 

with  dom  /  =  R++,  is  given  by 


/*(  v) 


o  e:=i^<i 

Too  otherwise. 


Quasiconvex  functions 

3.42  Approximation  width.  Let  fo,  ■  ■  ■ ,  fn  :  R  — »  R  be  given  continuous  functions.  We  consider 
the  problem  of  approximating  fo  as  a  linear  combination  of  /i, . . . ,  /„.  For  x  £  R",  we 
say  that  /  =  xi/i  T  •  •  •  T  xnfn  approximates  fo  with  tolerance  t  >  0  over  the  interval 
[0,  T]  if  |  f(t)  —  fo(t)\  <  e  for  0  <  t  <  T.  Now  we  choose  a  fixed  tolerance  e  >  0  and  define 
the  approximation  width  as  the  largest  T  such  that  /  approximates  fo  over  the  interval 

[0  ,T): 

W{x)  =  SUp{T  |  \xifi(t)  H - T  Xnfn  (t)  -  /o(f)|  <  e  for  0  <  t  <  T}. 

Show  that  W  is  quasiconcave. 

3.43  First-order  condition  for  quasiconvexity.  Prove  the  first-order  condition  for  quasiconvexity 
given  in  §3.4.3:  A  differentiable  function  /  :  R"  — ¥  R,  with  dom  /  convex,  is  quasiconvex 
if  and  only  if  for  all  x,  y  £  dom  /, 

f{y)  <  /(*)  =>  V/(x)T(y  -  x)  <  0. 

Hint.  It  suffices  to  prove  the  result  for  a  function  on  R;  the  general  result  follows  by 
restriction  to  an  arbitrary  line. 

3.44  Second-order  conditions  for  quasiconvexity.  In  this  problem  we  derive  alternate  repre¬ 
sentations  of  the  second-order  conditions  for  quasiconvexity  given  in  §3.4.3.  Prove  the 
following. 

(a)  A  point  x  £  dom  /  satisfies  (3.21)  if  and  only  if  there  exists  a  a  such  that 

V2/(x)  T  aV/(x)V/(x)T  T  0.  (3.26) 

It  satisfies  (3.22)  for  all  y  ^  0  if  and  only  if  there  exists  a  cr  such 
V2/(x)  T  nV/(x)V/(x)T  T  0. 

Hint.  We  can  assume  without  loss  of  generality  that  V2/(x)  is  diagonal. 


(3.27) 
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(b)  A  point  x  £  dom  /  satisfies  (3.21)  if  and  only  if  either  X7f(x)  =  0  and  V2f(x)  X  0, 
or  V/(*)  ^  0  and  the  matrix 


H(x)  = 


V2f(x)  Vf(x) 

V/(z)T  0 


has  exactly  one  negative  eigenvalue.  It  satisfies  (3.22)  for  all  y  ^  0  if  and  only  if 
H(x )  has  exactly  one  nonpositive  eigenvalue. 

Hint.  You  can  use  the  result  of  part  (a).  The  following  result,  which  follows  from 
the  eigenvalue  interlacing  theorem  in  linear  algebra,  may  also  be  useful:  If  B  £  Sn 
and  a  £  R’1,  then 


A 


n 


B  a 
aT  0 


>  A n(B). 


3.45  Use  the  first  and  second-order  conditions  for  quasiconvexity  given  in  §3.4.3  to  verify 
quasiconvexity  of  the  function  f(x)  =  —*1*2,  with  dom  /  =  R++. 

3.46  Quasilinear  functions  with  domain  R".  A  function  on  R  that  is  quasilinear  ( i.e .,  qua- 
siconvex  and  quasiconcave)  is  monotone,  i.e.,  either  nondecreasing  or  nonincreasing.  In 
this  problem  we  consider  a  generalization  of  this  result  to  functions  on  R’\ 

Suppose  the  function  /  :  R"  — >  R  is  quasilinear  and  continuous  with  dom  /  =  R".  Show 
that  it  can  be  expressed  as  f(x)  =  g(aTx),  where  g  :  R  — >  R  is  monotone  and  a  £  R". 
In  other  words,  a  quasilinear  function  with  domain  R"  must  be  a  monotone  function  of 
a  linear  function.  (The  converse  is  also  true.) 


Log-concave  and  log-convex  functions 

3.47  Suppose  /  :  R"  — >  R  is  differentiable,  dom  /  is  convex,  and  f(x)  >  0  for  all  x  £  dom/. 
Show  that  /  is  log-concave  if  and  only  if  for  all  x,y  £  dom/, 


f(y) 

f(x) 


<  exp 


(  Vf(x)T(y~x)\ 

V  /(*)  )  ' 


3.48  Show  that  if  /  :  R”  — »  R  is  log-concave  and  a  >  0,  then  the  function  g  =  /  —  a  is 
log-concave,  where  dom  g  =  {x  £  dom /  |  f(x)  >  a}. 

3.49  Show  that  the  following  functions  are  log-concave. 

(a)  Logistic  function:  f(x)  =  ex /(I  +  ex)  with  dom  /  =  R. 

(b)  Harmonic  mean: 


f(x) 


1 

1/Xl  -I - +  l/Xn  ’ 


dom  /  =  R"  +  . 


(c)  Product  over  sum: 


/(*) 


nr=i 

En 

i= 1 


Xi 

Xi 


dom  /  =  R++. 


f(X) 


det  X 

tr  X  ’ 


dom  /  =  S"  +  . 


(d)  Determinant  over  trace: 
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3.50  Coefficients  of  a  polynomial  as  a  function  of  the  roots.  Show  that  the  coefficients  of  a 
polynomial  with  real  negative  roots  are  log-concave  functions  of  the  roots.  In  other  words, 
the  functions  Oi  :  R"  — >  R,  defined  by  the  identity 

s  +  ai(A)s  +  •  •  •  +  an—  i(A)s  +  an{X)  =  (s  —  Ai)(s  —  A2)  •  •  •  (s  —  An), 

are  log-concave  on  —  R++- 
Hint.  The  function 

Sk(x)  =  ^  XixXi2  ■  ■  -Xik, 

with  dom  Sk  £  R+  and  1  <  k  <  n,  is  called  the  fcth  elementary  symmetric  function  on 
Rn.  It  can  be  shown  that  S]Jk  is  concave  (see  [ML57]). 

3.51  [BL00,  page  41]  Let  p  be  a  polynomial  on  R,  with  all  its  roots  real.  Show  that  it  is 
log-concave  on  any  interval  on  which  it  is  positive. 

3.52  [M079,  §3.E.2]  Log-convexity  of  moment  functions.  Suppose  /  :  R  — >  R  is  nonnegative 
with  R+  C  dom/.  For  x  >  0  define 

poo 

4>(x)  =  /  ux  f(u)  du. 

J  0 

Show  that  4>  is  a  log-convex  function.  (If  a:  is  a  positive  integer,  and  /  is  a  probability 
density  function,  then  rf>(x)  is  the  arth  moment  of  the  distribution.) 

Use  this  to  show  that  the  Gamma  function, 

poo 

T(x)  =  /  ux~1e~u  du, 

Jo 

is  log-convex  for  x  >  1. 

3.53  Suppose  x  and  y  are  independent  random  vectors  in  R",  with  log-concave  probability 
density  functions  /  and  g,  respectively.  Show  that  the  probability  density  function  of  the 
sum  z  =  x  +  y  is  log-concave. 

3.54  Log-concavity  of  Gaussian  cumulative  distribution  function.  The  cumulative  distribution 
function  of  a  Gaussian  random  variable, 


fix)  =  -^=  [  e  *2/2  dt , 

V  -‘'X  J  —oo 

is  log-concave.  This  follows  from  the  general  result  that  the  convolution  of  two  log-concave 
functions  is  log-concave.  In  this  problem  we  guide  you  through  a  simple  self-contained 
proof  that  /  is  log-concave.  Recall  that  /  is  log-concave  if  and  only  if  f"(x)f(x)  <  f'(x)2 
for  all  x. 

(a)  Verify  that  f"(x)f(x)  <  f'(x)2  for  x  >  0.  That  leaves  us  the  hard  part,  which  is  to 
show  the  inequality  for  x  <  0. 

(b)  Verify  that  for  any  t  and  x  we  have  t2/ 2  >  —x2/ 2  +  xt. 

(c)  Using  part  (b)  show  that  e-t  <  ex  !2~xt ,  Conclude  that,  for  x  <  0, 

r  e-‘2/2  dt  <  ex2/2  r  e~xt  dt. 

J  —OO  'J  ~  OO 

(d)  Use  part  (c)  to  verify  that  f"(x)f(x)  <  f'(x)2  for  x  <  0. 
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3.55  Log-concavity  of  the  cumulative  distribution  function  of  a  log-concave  probability  density. 
In  this  problem  we  extend  the  result  of  exercise  3.54.  Let  g(t)  =  exp(— h(t))  be  a  differ¬ 
entiable  log-concave  probability  density  function,  and  let 


/(*)  =  /  g(t)dt  = 


a~h(t) 


dt 


be  its  cumulative  distribution.  We  will  show  that  /  is  log-concave,  i.e.,  it  satisfies 
f"(x)f(x)  <  (/'(a:))2  for  all  x. 

(a)  Express  the  derivatives  of  /  in  terms  of  the  function  h.  Verify  that  f"(x)f(x)  < 
(f'(x))2  if  h'(x)  >  0. 

(b)  Assume  that  h'(x)  <  0.  Use  the  inequality 

h(t)  >  h(x)  +  h'(x)(t  —  x) 

(which  follows  from  convexity  of  h ),  to  show  that 

g-M*) 

—h'(x) 

Use  this  inequality  to  verify  that  f"(x)f(x)  <  (f'(x))2  if  h'(x)  <  0. 


-h(t) 


dt  < 


3.56  More  log-concave  densities.  Show  that  the  following  densities  are  log-concave. 

(a)  [M079,  page  493]  The  gamma  density ,  defined  by 

'<*> = m*-1*-" 

with  dom/  =  R+.  The  parameters  A  and  a  satisfy  A  >  1,  a  >  0. 

(b)  [M079,  page  306]  The  Dirichlet  density 


T(A1)---r(A„+i)  1 

\  t=±  / 

with  dom  /  —  {x  £  R”  +  |  1T£  <  1}.  The  parameter  A  satisfies  A  X  1. 


Convexity  with  respect  to  a  generalized  inequality 

3.57  Show  that  the  function  f(X)  =  X -1  is  matrix  convex  on  S"  + 

3.58  Schur  complement.  Suppose  I  6  S"  partitioned  as 


A 

Bt  C 


where  A  £  Sfc.  The  Schur  complement  of  X  (with  respect  to  A)  is  S  =  C  —  BT A  1 B 
(see  §A.5.5).  Show  that  the  Schur  complement,  viewed  as  a  function  from  S71  into  Sn~k, 
is  matrix  concave  on  S+_|_. 

3.59  Second-order  conditions  for  K -convexity.  Let  K  C  Rm  be  a  proper  convex  cone,  with 
associated  generalized  inequality  -<K .  Show  that  a  twice  differentiable  function  /  :  Rn  — > 
Rm,  with  convex  domain,  is  A'-convex  if  and  only  if  for  all  x  £  dom  f  and  all  y  £  R’1, 


d2f(x) 


dxidxj 

i,j= 1 


yn a  hK  o, 


i.e.,  the  second  derivative  is  a  7\ -nonnegative  bilinear  form.  (Here  d2 f  /dxidxj  £  Rm, 
with  components  d2 fk/dxidxj,  for  k  =  1, . . . ,  m;  see  §A.4.1.) 
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3.60  Sublevel  sets  and  epigraph  of  K -convex  functions.  Let  K  C  Rm  be  a  proper  convex  cone 
with  associated  generalized  inequality  <k,  and  let  /  :  R“  — >  Rm.  For  a  £  Rm,  the 
a-sublevel  set  of  /  (with  respect  to  ~Sk)  is  defined  as 

Ca  =  {x  £  R’1  |  f(x)  <k  a}- 

The  epigraph  of  /,  with  respect  to  <k,  is  defined  as  the  set 

epi kS  =  {(x,t)  £  Rn+m  |  /(*)  -<K  t}. 


Show  the  following: 

(a)  If  /  is  A'-convex,  then  its  sublevel  sets  Ca  are  convex  for  all  a. 

(b)  /  is  K -convex  if  and  only  if  epiK  /  is  a  convex  set. 


Chapter  4 


Convex  optimization  problems 


Optimization  problems 

Basic  terminology 

We  use  the  notation 

minimize  fo  ( x ) 

subject  to  fi(x)<  0,  i  =  (4.1) 

hi(x)  =  0,  i  =  l,...,p 

to  describe  the  problem  of  finding  an  x  that  minimizes  fo{x)  among  all  x  that  satisfy 
the  conditions  fi(x)  <  0,  i  =  1, . . . ,  to,  and  hj(x)  =  0,  i  =  1, . . . ,  p.  We  call  x  £  R” 
the  optimization  variable  and  the  function  f0  :  R"  — >  R  the  objective  function  or 
cost  function^  The  inequalities  fc(x )  <  0  are  called  inequality  constraints ,  and  the 
corresponding  functions  fi  :  R"  — >  R  are  called  the  inequality  constraint  functions. 
The  equations  hi(x)  =  0  are  called  the  equality  constraints ,  and  the  functions 
hi  :  R”  -»•  R  are  the  equality  constraint  functions.  (If  there  are  no  constraints  ( i.e ., 
m  =  p  =  0)  we  say  the  problem  (4.1)  is  unconstrained. 

The  set  of  points  for  which  the  objective  and  all  constraint  functions  are  defined, 

m  p 

V  =  f"'|  dom  fi  fl  dom  hi, 

i= 0  i= 1 

is  called  the  domain  of  the  optimization  problem  (4.1).  A  point  x  G  T>  is  feasible 
if  it  satisfies  the  constraints  fi(x)  <  0,  i  =  1, . . . ,  m,  and  hi(x)  =  0,  i  =  1, . . .  ,p. 
The  problem  (4.1)  is  said  to  be  feasible  if  there  exists  at  least  one  feasible  point, 
and  infeasible  otherwise.  The  set  of  all  feasible  points  is  called  the  feasible  set  or 
the  constraint  set. 

The  optimal  value  p *  of  the  problem  (4.1)  is  defined  as 

P*  =  inf  {fo(x)  |  fi(x)  <0,  i=l,...,m,  h^x)  =0,  i  =  1, . ..  ,p}. 

We  allow  p*  to  take  on  the  extended  values  ±oo.  (If  the  problem  is  infeasible,  we 
have  p*  =  oo  (following  the  standard  convention  that  the  infimum  of  the  empty  set 
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is  oo).  If  there  are  feasible  points  Xk  with  fo(xk)  — >  —  oo  as  k  — >  oo,  then  p*  =  — oo, 
and  we  say  the  problem  (4.1)  is  unbounded  below. 

Optimal  and  locally  optimal  points 

We  say  x*  is  an  optimal  point ,  or  solves  the  problem  (4.1),  if  x *  is  feasible  and 
fo(x*)  =  p* .  The  set  of  all  optimal  points  is  the  optimal  set ,  denoted 

-X’opt  =  {x  |  fi{ x)  <0,  i  =  1, . . .  ,m,  hi(x)  =0,  i  =  l,...,p,  f0(x)  =  p*}. 

If  there  exists  an  optimal  point  for  the  problem  (4.1),  we  say  the  optimal  value 
is  attained  or  achieved ,  and  the  problem  is  solvable.  If  Xopt  is  empty,  we  say 
the  optimal  value  is  not  attained  or  not  achieved.  (This  always  occurs  when  the 
problem  is  unbounded  below.)  A  feasible  point  x  with  fo(x)  <  p*  +  e  (where 
e  >  0)  is  called  e-suboptimal ,  and  the  set  of  all  e-suboptimal  points  is  called  the 
e-suboptimal  set  for  the  problem  (4.1). 

We  say  a  feasible  point  x  is  locally  optimal  if  there  is  an  R  >  0  such  that 

fo(x)  =  iv£{f0(z)  |  fi(z)  <  0,  i  =  1, . . . ,  to, 

hi(z)  =  0,  i  =  l,...,p,  \\z  -  x\\2  <  R}, 

or,  in  other  words,  x  solves  the  optimization  problem 

minimize  foiz) 

subject  to  fi(z)  <  0,  i  =  1, . . . ,  m 
hi(z)  =  0,  i  =  l,...,p 
I \z  -  x\\2  <  R 

with  variable  z.  Roughly  speaking,  this  means  x  minimizes  /o  over  nearby  points 
in  the  feasible  set.  The  term  ‘globally  optimal’  is  sometimes  used  for  ‘optimal’ 
to  distinguish  between  ‘locally  optimal’  and  ‘optimal’.  Throughout  this  book, 
however,  optimal  will  mean  globally  optimal. 

If  x  is  feasible  and  fi(x)  =  0,  we  say  the  ith  inequality  constraint  fi(x)  <  0  is 
active  at  x.  If  fi(x)  <  0,  we  say  the  constraint  fi{x)  <  0  is  inactive.  (The  equality 
constraints  are  active  at  all  feasible  points.)  We  say  that  a  constraint  is  redundant 
if  deleting  it  does  not  change  the  feasible  set. 


Example  4.1  We  illustrate  these  definitions  with  a  few  simple  unconstrained  opti¬ 
mization  problems  with  variable  ieR,  and  dom  fo  =  R.++ . 

•  fo(x)  =  l/x\  p*  =  0,  but  the  optimal  value  is  not  achieved. 

•  fo(x)  =  —  log*:  p *  =  — oo,  so  this  problem  is  unbounded  below. 

•  fo(x)  =  *log*:  p*  =  —1/e,  achieved  at  the  (unique)  optimal  point  x*  =  1/e. 


Feasibility  problems 

If  the  objective  function  is  identically  zero,  the  optimal  value  is  either  zero  (if  the 
feasible  set  is  nonempty)  or  oo  (if  the  feasible  set  is  empty).  We  call  this  the 
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feasibility  problem,  and  will  sometimes  write  it  as 
find  x 

subject  to  fi{x)  <0,  i  =  1, . . . ,  m 
hi (x)  =0,  i  =  1,. ..  ,p. 

The  feasibility  problem  is  thus  to  determine  whether  the  constraints  are  consistent, 
and  if  so,  find  a  point  that  satisfies  them. 


4.1.2  Expressing  problems  in  standard  form 

We  refer  to  (4.1)  as  an  optimization  problem  in  standard  form.  In  the  standard 
form  problem  we  adopt  the  convention  that  the  righthand  side  of  the  inequality 
and  equality  constraints  are  zero.  This  can  always  be  arranged  by  subtracting  any 
nonzero  righthand  side:  we  represent  the  equality  constraint  Qifx)  =  gi(x),  for 
example,  as  hi{x)  =  0,  where  hi(x)  =  gi(x)  —  gi(x).  In  a  similar  way  we  express 
inequalities  of  the  form  fi(x)  >  0  as  —fi(x)  <  0. 


Example  4.2  Box  constraints.  Consider  the  optimization  problem 
minimize  fo{x) 

subject  to  h  <  Xi  <  Ui,  i  =  1, . . . ,  n, 

where  x  £  R”  is  the  variable.  The  constraints  are  called  variable  bounds  (since  they 
give  lower  and  upper  bounds  for  each  xf)  or  box  constraints  (since  the  feasible  set  is 
a  box). 

We  can  express  this  problem  in  standard  form  as 
minimize  fo(x) 

subject  to  h  —  Xi  <  0,  *  =  1, . . . ,  n 

Xi  —  Ui  <  0,  ?  =  1, . . . ,  n. 

There  are  2 n  inequality  constraint  functions: 

fi(x)  =  k-  Xi,  i  =  l,...,n, 


and 


fi(x)  =  Xi-n  -  Ui-„,  i  =  n  +  1, . . . ,  2n. 


Maximization  problems 

We  concentrate  on  the  minimization  problem  by  convention.  We  can  solve  the 
maximization  problem 


maximize  fo  ( x ) 

subject  to  fi(x)  <0,  i  =  1, . . . ,  m 
hi{x)  =0,  i  =  1,. ..  ,p 


(4.2) 


130 


4  Convex  optimization  problems 


by  minimizing  the  function  — /o  subject  to  the  constraints.  By  this  correspondence 
we  can  define  all  the  terms  above  for  the  maximization  problem  (4.2).  For  example 
the  optimal  value  of  (4.2)  is  defined  as 

P *  =  sup{/0(a;)  |  fi(x)  <0,  i  =  l,...,m,  h^x)  =0,  i  =  1, . . .  ,p}, 

and  a  feasible  point  x  is  e-suboptimal  if  fo(x)  >  p*  —  e.  When  the  maximization 
problem  is  considered,  the  objective  is  sometimes  called  the  utility  or  satisfaction 
level  instead  of  the  cost. 


4.1.3  Equivalent  problems 

In  this  book  we  will  use  the  notion  of  equivalence  of  optimization  problems  in  an 
informal  way.  We  call  two  problems  equivalent  if  from  a  solution  of  one,  a  solution 
of  the  other  is  readily  found,  and  vice  versa.  (It  is  possible,  but  complicated,  to 
give  a  formal  definition  of  equivalence.) 

As  a  simple  example,  consider  the  problem 

minimize  f(x)  =  aofo{x) 

subject  to  fi{x)  =  otifilx)  <0,  i  =  1, . . . ,  m  (4-3) 

hi(x)  =  pihi(x)  =  0,  i  =  1, . . .  ,p, 

where  a,  >  0,  i  =  0, . . . ,  to,  and  /3j  0,  *  =  1, . . .  ,p.  This  problem  is  obtained  from 

the  standard  form  problem  (4.1)  by  scaling  the  objective  and  inequality  constraint 
functions  by  positive  constants,  and  scaling  the  equality  constraint  functions  by 
nonzero  constants.  As  a  result,  the  feasible  sets  of  the  problem  (4.3)  and  the  original 
problem  (4.1)  are  identical.  A  point  x  is  optimal  for  the  original  problem  (4.1)  if 
and  only  if  it  is  optimal  for  the  scaled  problem  (4.3),  so  we  say  the  two  problems  are 
equivalent.  The  two  problems  (4.1)  and  (4.3)  are  not,  however,  the  same  (unless 
ctj  and  Pi  are  all  equal  to  one),  since  the  objective  and  constraint  functions  differ. 
We  now  describe  some  general  transformations  that  yield  equivalent  problems. 

Change  of  variables 

Suppose  </>  :  R™  — >  Rn  is  one-to-one,  with  image  covering  the  problem  domain  T> , 
i.e .,  <^(dom0)  13  V.  We  define  functions  /)  and  hi  as 

Mz)  =  fi(Hz)),  *  =  0,...,m,  hi(z)  =  hi{<t>{z)),  i  =  l,...,p. 

Now  consider  the  problem 

minimize  fo(z) 

subject  to  fi{z)  <  0,  i  =  1, . . . ,  to  (4.4) 

hi{z)  =  0,  i=l,...,p, 

with  variable  2.  We  say  that  the  standard  form  problem  (4.1)  and  the  problem  (4.4) 
are  related  by  the  change  of  variable  or  substitution  of  variable  x  =  4>(z). 

The  two  problems  are  clearly  equivalent:  if  x  solves  the  problem  (4.1),  then 
2  =  <^_1(a:)  solves  the  problem  (4.4);  if  z  solves  the  problem  (4.4),  then  x  =  <j>{z) 
solves  the  problem  (4.1). 
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Transformation  of  objective  and  constraint  functions 

Suppose  that  ipo  :  R  — ►  R  is  monotone  increasing,  ipi , . . . ,  ipm  :  R  — >  R  satisfy 
ipi(u)  <  0  if  and  only  if  u  <  0,  and  0m+i, . . . ,  V’m+p  :  R  — ►  R  satisfy  ipi(u)  =  0  if 
and  only  if  u  =  0.  We  define  functions  _/)  and  /q  as  the  compositions 

fi(x)  =  i  =  0,...,m,  hi(x)  =  ipm+i(hi(  x)),  i  =  l,...,p. 

Evidently  the  associated  problem 

minimize  /o  ( x ) 

subject  to  fi{x)  <0,  i  =  1, . . . ,  to 
hi{x)  =  0,  i  =  l,...,p 

and  the  standard  form  problem  (4.1)  are  equivalent;  indeed,  the  feasible  sets  are 
identical,  and  the  optimal  points  are  identical.  (The  example  (4.3)  above,  in  which 
the  objective  and  constraint  functions  are  scaled  by  appropriate  constants,  is  the 
special  case  when  all  ipi  are  linear.) 


Example  4.3  Least-norm  and  least-norm-squared  problems.  As  a  simple  example 
consider  the  unconstrained  Euclidean  norm  minimization  problem 

minimize  ||  Ax  —  6||  2 ,  (4-5) 

with  variable  x  £  R".  Since  the  norm  is  always  nonnegative,  we  can  just  as  well  solve 
the  problem 

minimize  \\Ax  —  b|||  =  {Ax  —  b)T  {Ax  —  b),  (4-6) 

in  which  we  minimize  the  square  of  the  Euclidean  norm.  The  problems  (4.5)  and  (4.6) 
are  clearly  equivalent;  the  optimal  points  are  the  same.  The  two  problems  are  not 
the  same,  however.  For  example,  the  objective  in  (4.5)  is  not  differentiable  at  any 
x  with  Ax  —  b  =  0,  whereas  the  objective  in  (4.6)  is  differentiable  for  all  x  (in  fact, 
quadratic) . 


Slack  variables 

One  simple  transformation  is  based  on  the  observation  that  fi{x)  <  0  if  and  only  if 
there  is  an  s*  >  0  that  satisfies  fi{x)  +  s*  =  0.  Using  this  transformation  we  obtain 
the  problem 

minimize  fo  ( x ) 

subject  to  Sj>0,  i=l,...,m  , , 

fi{x)  +  Si  =  0,  i  =  1, . . . ,  to 
hi{x)  =  0,  i  =  l,...,p, 

where  the  variables  are  x  £  R"  and  s  £  Rm.  This  problem  has  n  +  m  variables, 
m  inequality  constraints  (the  nonnegativity  constraints  on  s»),  and  m+p  equality 
constraints.  The  new  variable  s-i  is  called  the  slack  variable  associated  with  the 
original  inequality  constraint  fi{x)  <  0.  Introducing  slack  variables  replaces  each 
inequality  constraint  with  an  equality  constraint,  and  a  nonnegativity  constraint. 

The  problem  (4.7)  is  equivalent  to  the  original  standard  form  problem  (4.1). 
Indeed,  if  {x,s)  is  feasible  for  the  problem  (4.7),  then  x  is  feasible  for  the  original 
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problem,  since  .s,  =  —/»(*)  >  0.  Conversely,  if  x  is  feasible  for  the  original  problem, 
then  (x,s)  is  feasible  for  the  problem  (4.7),  where  we  take  s*  =  —  Similarly, 
x  is  optimal  for  the  original  problem  (4.1)  if  and  only  if  (x,s)  is  optimal  for  the 
problem  (4.7),  where  Si  = 

Eliminating  equality  constraints 

If  we  can  explicitly  parametrize  all  solutions  of  the  equality  constraints 

hi(x)  =  0,  i  =  l,...,p,  (4.8) 

using  some  parameter  2  £  Rfc,  then  we  can  eliminate  the  equality  constraints 
from  the  problem,  as  follows.  Suppose  the  function  <j>  :  Rfc  — >■  R"  is  such  that 
x  satisfies  (4.8)  if  and  only  if  there  is  some  z  £  Rfc  such  that  x  =  (f>{z).  The 
optimization  problem 

minimize  fp(z)  =  fo{(j>{z)) 
subject  to  fi(z )  =  fi{<j>(z))  <0,  i  = 

is  then  equivalent  to  the  original  problem  (4.1).  This  transformed  problem  has 
variable  z  £  Rfe,  m  inequality  constraints,  and  no  equality  constraints.  If  z  is 
optimal  for  the  transformed  problem,  then  x  —  <f>{z)  is  optimal  for  the  original 
problem.  Conversely,  if  x  is  optimal  for  the  original  problem,  then  (since  x  is 
feasible)  there  is  at  least  one  z  such  that  x  =  4>(z).  Any  such  z  is  optimal  for  the 
transformed  problem. 

Eliminating  linear  equality  constraints 

The  process  of  eliminating  variables  can  be  described  more  explicitly,  and  easily 
carried  out  numerically,  when  the  equality  constraints  are  all  linear,  i.e.,  have  the 
form  Ax  =  b.  If  Ax  =  b  is  inconsistent,  i.e.,  b  (jL  'R.(A),  then  the  original  problem  is 
infeasible.  Assuming  this  is  not  the  case,  let  Xq  denote  any  solution  of  the  equality 
constraints.  Let  F  £  R"xfe  be  any  matrix  with  7 Z(F)  =  A f(A),  so  the  general 
solution  of  the  linear  equations  Ax  =  b  is  given  by  Fz  +  x o,  where  z  £  Rfc.  (We 
can  choose  F  to  be  full  rank,  in  which  case  we  have  k  =  n  —  rank  A.) 

Substituting  x  =  Fz  +  Xo  into  the  original  problem  yields  the  problem 

minimize  /o  (Fz  +  Xo ) 

subject  to  fi(Fz  +  Xo)  <0,  i  =  1, . . . ,  m, 

with  variable  z,  which  is  equivalent  to  the  original  problem,  has  no  equality  con¬ 
straints,  and  rank  A  fewer  variables. 

Introducing  equality  constraints 

We  can  also  introduce  equality  constraints  and  new  variables  into  a  problem.  In¬ 
stead  of  describing  the  general  case,  which  is  complicated  and  not  very  illuminating, 
we  give  a  typical  example  that  will  be  useful  later.  Consider  the  problem 

minimize  fo(Aox  +  bo) 
subject  to  fi(AiX  +  bi)  <  0,  i  =  1, . . . ,  m 
hi(x)  =  0,  i  =  l,...,p, 
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where  x  £  R”,  Ai  £  RkiXn,  and  /,;  :  Rfci  — >  R.  In  this  problem  the  objective 
and  constraint  functions  are  given  as  compositions  of  the  functions  ft  with  affine 
transformations  defined  by  AiX  +  6,;. 

We  introduce  new  variables  yi  £  Rki,  as  well  as  new  equality  constraints  y,;  = 
AiX  +  bi,  for  i  =  0, . . . ,  m,  and  form  the  equivalent  problem 

minimize  fo(yo) 
subject  to  fi(yi)  <0, 

Vi  =  AiX  +  bi,  i  =  Q,...,m 
hi(x)  =  0,  i  =  l,...,p. 

This  problem  has  ko  +  •  •  •  +  km  new  variables, 

yo£  Rfco,  ym&  Rfcm, 

and  ko  +  ■  ■  ■  +  km  new  equality  constraints, 


y0  =  A0x  +  b0 ,  ym  =  Amx  +  bm. 

The  objective  and  inequality  constraints  in  this  problem  are  independent,  i.e.,  in¬ 
volve  different  optimization  variables. 

Optimizing  over  some  variables 

We  always  have 

inf  f(x,  y )  =  inf  f(x) 

x ,y  x 

where  f(x)  =  inf^  f(x,y).  In  other  words,  we  can  always  minimize  a  function  by 
first  minimizing  over  some  of  the  variables,  and  then  minimizing  over  the  remaining 
ones.  This  simple  and  general  principle  can  be  used  to  transform  problems  into 
equivalent  forms.  The  general  case  is  cumbersome  to  describe  and  not  illuminating, 
so  we  describe  instead  an  example. 

Suppose  the  variable  x  £  R"  is  partitioned  as  x  =  ( x±,X2 ),  with  xi  £  R"1, 
X‘2  £  R"2 ,  and  rq  +  ri2  =  n.  We  consider  the  problem 

minimize  fo{xi,X2) 

subject  to  fi(x i)  <  0,  i  =  1, . . . ,  m \  (4.9) 

fi(x 2)  <0,  i  —  1,  .  .  .,1712, 

in  which  the  constraints  are  independent,  in  the  sense  that  each  constraint  function 
depends  on  Xi  or  X2-  We  first  minimize  over  X2-  Define  the  function  fo  of  X\  by 

fo(xi)  =  inf{/0(a;i, z)  \  fi(z)  <  0,  i  =  1 ,. ..  ,m2}. 

The  problem  (4.9)  is  then  equivalent  to 


minimize  fo{xi) 

subject  to  fi(x  1)  <0,  i  =  1, . . . ,  m\. 


(4.10) 
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Example  4.4  Minimizing  a  quadratic  function  with  constraints  on  some  variables. 
Consider  a  problem  with  strictly  convex  quadratic  objective,  with  some  of  the  vari¬ 
ables  unconstrained: 

minimize  xf  PnXi  +  2xJ P12X2  +  x%  P22X2 

subject  to  fi(x  1)  <  0, 

where  F\  1  and  P22  are  symmetric.  Here  we  can  analytically  minimize  over  X2' 
inf  (xf P11X1  +  2xfPi2X2  +  X2P22X2)  =  Xi  (Pu  -  P12P22  P12)  Xi 

(see  §A.5.5).  Therefore  the  original  problem  is  equivalent  to 

minimize  xf  (Pu  —  P12P22  P12)  xi 
subject  to  fi(x  1)  <0,  i  =  1, . . . ,  m. 


Epigraph  problem  form 

The  epigraph  form  of  the  standard  problem  (4.1)  is  the  problem 

minimize  t 

subject  to  f0(x)  -  t  <  0 

fi{x)  <  0,  i  =  1,. . .  ,m  ' 

hi(x)  =  0,  i  =  l,...,p, 

with  variables  x  £  R™  and  t  £  R.  We  can  easily  see  that  it  is  equivalent  to  the 
original  problem:  (x,t)  is  optimal  for  (4.11)  if  and  only  if  x  is  optimal  for  (4.1) 
and  t  =  fo{x).  Note  that  the  objective  function  of  the  epigraph  form  problem  is  a 
linear  function  of  the  variables  x,  t. 

The  epigraph  form  problem  (4.11)  can  be  interpreted  geometrically  as  an  op¬ 
timization  problem  in  the  ‘graph  space’  (x,t):  we  minimize  t  over  the  epigraph  of 
/o,  subject  to  the  constraints  on  x.  This  is  illustrated  in  figure  4.1. 

Implicit  and  explicit  constraints 

By  a  simple  trick  already  mentioned  in  §3.1.2,  we  can  include  any  of  the  constraints 
implicitly  in  the  objective  function,  by  redefining  its  domain.  As  an  extreme  ex¬ 
ample,  the  standard  form  problem  can  be  expressed  as  the  unconstrained  problem 

minimize  F(x),  (4-12) 

where  we  define  the  function  F  as  fo,  but  with  domain  restricted  to  the  feasible 
set: 


domF  =  {x  €  dom/o  |  f,{x)  <0,  i  =  1,. . .  ,m,  hi{x)  =0,  i  =  1, . . .  ,p}, 

and  F(x)  =  fo(x)  for  x  £  domF.  (Equivalently,  we  can  define  F(x)  to  have  value 
00  for  x  not  feasible.)  The  problems  (4.1)  and  (4.12)  are  clearly  equivalent:  they 
have  the  same  feasible  set,  optimal  points,  and  optimal  value. 

Of  course  this  transformation  is  nothing  more  than  a  notational  trick.  Making 
the  constraints  implicit  has  not  made  the  problem  any  easier  to  analyze  or  solve, 
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Figure  4.1  Geometric  interpretation  of  epigraph  form  problem,  for  a  prob¬ 
lem  with  no  constraints.  The  problem  is  to  find  the  point  in  the  epigraph 
(shown  shaded)  that  minimizes  t,  i.e.,  the  ‘lowest’  point  in  the  epigraph. 
The  optimal  point  is  (x*,t*). 


even  though  the  problem  (4.12)  is,  at  least  nominally,  unconstrained.  In  some  ways 
the  transformation  makes  the  problem  more  difficult.  Suppose,  for  example,  that 
the  objective  /0  in  the  original  problem  is  differentiable,  so  in  particular  its  domain 
is  open.  The  restricted  objective  function  F  is  probably  not  differentiable,  since 
its  domain  is  likely  not  to  be  open. 

Conversely,  we  will  encounter  problems  with  implicit  constraints,  which  we  can 
then  make  explicit.  As  a  simple  example,  consider  the  unconstrained  problem 

minimize  f(x)  (4-13) 

where  the  function  /  is  given  by 

.  _  (  xTx  Ax  =  b 
J '  '  (  oo  otherwise. 


Thus,  the  objective  function  is  equal  to  the  quadratic  form  xT x  on  the  affine  set 
defined  by  Ax  =  b ,  and  oo  off  the  affine  set.  Since  we  can  clearly  restrict  our 
attention  to  points  that  satisfy  Ax  =  b,  we  say  that  the  problem  (4.13)  has  an 
implicit  equality  constraint  Ax  —  b  hidden  in  the  objective.  We  can  make  the 
implicit  equality  constraint  explicit,  by  forming  the  equivalent  problem 

minimize  xTx  1 

subject  to  Ax  =  b.  \  ) 


While  the  problems  (4.13)  and  (4.14)  are  clearly  equivalent,  they  are  not  the  same. 
The  problem  (4.13)  is  unconstrained,  but  its  objective  function  is  not  differentiable. 
The  problem  (4.14),  however,  has  an  equality  constraint,  but  its  objective  and 
constraint  functions  are  differentiable. 
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4.1.4  Parameter  and  oracle  problem  descriptions 

For  a  problem  in  the  standard  form  (4.1),  there  is  still  the  question  of  how  the 
objective  and  constraint  functions  are  specified.  In  many  cases  these  functions 
have  some  analytical  or  closed  form,  i.e.,  are  given  by  a  formula  or  expression  that 
involves  the  variable  x  as  well  as  some  parameters.  Suppose,  for  example,  the 
objective  is  quadratic,  so  it  has  the  form  fo(x)  =  (1/2 )xTPx  +  qTx  +  r.  To  specify 
the  objective  function  we  give  the  coefficients  (also  called  problem,  parameters  or 
problem  data)  P  £  S”,  q  £  R”,  and  r  €  R.  We  call  this  a  parameter  problem 
description,  since  the  specific  problem  to  be  solved  (i.e.,  the  problem  instance)  is 
specified  by  giving  the  values  of  the  parameters  that  appear  in  the  expressions  for 
the  objective  and  constraint  functions. 

In  other  cases  the  objective  and  constraint  functions  are  described  by  oracle 
models  (which  are  also  called  black  box  or  subroutine  models).  In  an  oracle  model, 
we  do  not  know  /  explicitly,  but  can  evaluate  f(x)  (and  usually  also  some  deriva¬ 
tives)  at  any  x  €  dom  /.  This  is  referred  to  as  querying  the  oracle,  and  is  usually 
associated  with  some  cost,  such  as  time.  We  are  also  given  some  prior  information 
about  the  function,  such  as  convexity  and  a  bound  on  its  values.  As  a  concrete 
example  of  an  oracle  model,  consider  an  unconstrained  problem,  in  which  we  are 
to  minimize  the  function  f .  The  function  value  /( x)  and  its  gradient  V f(x)  are 
evaluated  in  a  subroutine.  We  can  call  the  subroutine  at  any  x  £  dom  /,  but  do 
not  have  access  to  its  source  code.  Calling  the  subroutine  with  argument  x  yields 
(when  the  subroutine  returns)  f(x)  and  Vf(x).  Note  that  in  the  oracle  model, 
we  never  really  know  the  function;  we  only  know  the  function  value  (and  some 
derivatives)  at  the  points  where  we  have  queried  the  oracle.  (We  also  know  some 
given  prior  information  about  the  function,  such  as  differentiability  and  convexity.) 

In  practice  the  distinction  between  a  parameter  and  oracle  problem  description 
is  not  so  sharp.  If  we  are  given  a  parameter  problem  description,  we  can  construct 
an  oracle  for  it,  which  simply  evaluates  the  required  functions  and  derivatives  when 
queried.  Most  of  the  algorithms  we  study  in  part  III  work  with  an  oracle  model,  but 
can  be  made  more  efficient  when  they  are  restricted  to  solve  a  specific  parametrized 
family  of  problems. 


4.2  Convex  optimization 

4.2.1  Convex  optimization  problems  in  standard  form 

A  convex  optimization  problem  is  one  of  the  form 
minimize  fo(x) 

subject  to  fi(x)  <0,  i  =  1, . . . ,  m  (4-15) 

afx  =  bi,  i  =  l,...,p, 

where  fo,  ■  ■  ■ ,  fm  are  convex  functions.  Comparing  (4.15)  with  the  general  standard 
form  problem  (4.1),  the  convex  problem  has  three  additional  requirements: 
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•  the  objective  function  must  be  convex, 

•  the  inequality  constraint  functions  must  be  convex, 

•  the  equality  constraint  functions  hi(x)  —  afx  —  bt  must  be  affine. 

We  immediately  note  an  important  property:  The  feasible  set  of  a  convex  optimiza¬ 
tion  problem  is  convex,  since  it  is  the  intersection  of  the  domain  of  the  problem 

m 

£>  =  P|  dom/i, 

2=0 

which  is  a  convex  set,  with  m  (convex)  sublevel  sets  {x  \  fi(x)  <  0}  and  p  hyper¬ 
planes  {x\  ajx  =  bi}.  (We  can  assume  without  loss  of  generality  that  aj  ^  0:  if 
a,i  =  0  and  bi  =  0  for  some  i,  then  the  ith  equality  constraint  can  be  deleted;  if 
a*  =  0  and  bi  ^  0,  the  ith  equality  constraint  is  inconsistent,  and  the  problem  is  in¬ 
feasible.)  Thus,  in  a  convex  optimization  problem,  we  minimize  a  convex  objective 
function  over  a  convex  set. 

If  /o  is  quasiconvex  instead  of  convex,  we  say  the  problem  (4.15)  is  a  (standard 
form)  quasiconvex  optimization  problem.  Since  the  sublevel  sets  of  a  convex  or 
quasiconvex  function  are  convex,  we  conclude  that  for  a  convex  or  quasiconvex 
optimization  problem  the  e-suboptimal  sets  are  convex.  In  particular,  the  optimal 
set  is  convex.  If  the  objective  is  strictly  convex,  then  the  optimal  set  contains  at 
most  one  point. 

Concave  maximization  problems 

With  a  slight  abuse  of  notation,  we  will  also  refer  to 
maximize  fo  (x) 

subject  to  fi(x)  <  0,  i  =  1, . . . ,  m  (4-16) 

ajx  =  bi ,  i  =  l,...,p, 

as  a  convex  optimization  problem  if  the  objective  function  fo  is  concave,  and  the 
inequality  constraint  functions  /i ,  •  •  • ,  fm  are  convex.  This  concave  maximization 
problem  is  readily  solved  by  minimizing  the  convex  objective  function  —fo.  All 
of  the  results,  conclusions,  and  algorithms  that  we  describe  for  the  minimization 
problem  are  easily  transposed  to  the  maximization  case.  In  a  similar  way  the 
maximization  problem  (4.16)  is  called  quasiconvex  if  fo  is  quasiconcave. 

Abstract  form  convex  optimization  problem 

It  is  important  to  note  a  subtlety  in  our  definition  of  convex  optimization  problem. 
Consider  the  example  with  x  £  R2, 

minimize  fo(x)  —  xi  +  x\ 

subject  to  fi{x)  =  aq/(l  +  x2)  <  0  (4-17) 

h\{x)  =  (x\  +  x2)2  =  0, 

which  is  in  the  standard  form  (4.1).  This  problem  is  not  a  convex  optimization 
problem  in  standard  form  since  the  equality  constraint  function  hi  is  not  affine,  and 
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the  inequality  constraint  function  fi  is  not  convex.  Nevertheless  the  feasible  set, 
which  is  {x  \  x±  <  0,  X\  +  X2  =0},  is  convex.  So  although  in  this  problem  we  are 
minimizing  a  convex  function  fo  over  a  convex  set,  it  is  not  a  convex  optimization 
problem  by  our  definition. 

Of  course,  the  problem  is  readily  reformulated  as 

minimize  fo(x)  =  x\  +  x\ 

subject  to  fi(x)  =  x\  <  0  (4-18) 

hi(x)  =  Xi  +  x2  =  0, 

which  is  in  standard  convex  optimization  form,  since  fo  and  fi  are  convex,  and  hi 
is  affine. 

Some  authors  use  the  term  abstract  convex  optimization  problem  to  describe  the 
(abstract)  problem  of  minimizing  a  convex  function  over  a  convex  set.  Using  this 
terminology,  the  problem  (4.17)  is  an  abstract  convex  optimization  problem.  We 
will  not  use  this  terminology  in  this  book.  For  us,  a  convex  optimization  problem  is 
not  just  one  of  minimizing  a  convex  function  over  a  convex  set;  it  is  also  required 
that  the  feasible  set  be  described  specifically  by  a  set  of  inequalities  involving 
convex  functions,  and  a  set  of  linear  equality  constraints.  The  problem  (4.17)  is 
not  a  convex  optimization  problem,  but  the  problem  (4.18)  is  a  convex  optimization 
problem.  (The  two  problems  are,  however,  equivalent.) 

Our  adoption  of  the  stricter  definition  of  convex  optimization  problem  does  not 
matter  much  in  practice.  To  solve  the  abstract  problem  of  minimizing  a  convex 
function  over  a  convex  set,  we  need  to  find  a  description  of  the  set  in  terms  of 
convex  inequalities  and  linear  equality  constraints.  As  the  example  above  suggests, 
this  is  usually  straightforward. 


4.2.2  Local  and  global  optima 

A  fundamental  property  of  convex  optimization  problems  is  that  any  locally  optimal 
point  is  also  (globally)  optimal.  To  see  this,  suppose  that  x  is  locally  optimal  for 
a  convex  optimization  problem,  i.e.,  x  is  feasible  and 

fo(x)  =  inf{/0(z)  |  ^  feasible,  \\z  -  x\\2  <  R},  (4.19) 

for  some  R  >  0.  Now  suppose  that  x  is  not  globally  optimal,  i.e.,  there  is  a  feasible 
y  such  that  fo(y)  <  fo{%)-  Evidently  || y  —  x\\2  >  R,  since  otherwise  fo(x)  <  fo{y)- 
Consider  the  point  2  given  by 

z={l-9)x  +  dy,  0  =  R  . 

2||j/  —  X\\2 

Then  we  have  \\z  —  rr|| 2  =  R/ 2  <  R,  and  by  convexity  of  the  feasible  set,  z  is 
feasible.  By  convexity  of  /0  we  have 

fo(z)  <  (1  -  0)fo(x)  +  0fo(y)  <  f0(x), 


which  contradicts  (4.19).  Hence  there  exists  no  feasible  y  with  fo(y)  <  fo(x),  i-e., 
x  is  globally  optimal. 
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Figure  4.2  Geometric  interpretation  of  the  optimality  condition  (4.21).  The 
feasible  set  X  is  shown  shaded.  Some  level  curves  of  fo  are  shown  as  dashed 
lines.  The  point  x  is  optimal:  —Xfo(x)  defines  a  supporting  hyperplane 
(shown  as  a  solid  line)  to  X  at  x. 


It  is  not  true  that  locally  optimal  points  of  quasiconvex  optimization  problems 
are  globally  optimal;  see  §4.2.5. 


4.2.3  An  optimality  criterion  for  differentiable  /0 

Suppose  that  the  objective  fo  in  a  convex  optimization  problem  is  differentiable, 
so  that  for  all  x,  y  £  dom  f0, 

fo(y)  >  fo{x)  +  Xf0{x)T{y  -  x)  (4.20) 

(see  §3.1.3).  Let  X  denote  the  feasible  set,  i.e., 

X  =  {x  |  fi(x)  <0,  i  =  1, . . .  ,m,  hi{x)  =0,  i  =  1, . . .  ,p}. 

Then  x  is  optimal  if  and  only  if  x  £  X  and 

Vfo(x)T(y  —  a:)  >  0  for  all  y  €  X.  (4-21) 

This  optimality  criterion  can  be  understood  geometrically:  If  V/o(x)  ^  0,  it  means 
that  —Vfo(x)  defines  a  supporting  hyperplane  to  the  feasible  set  at  x  (see  fig¬ 
ure  4.2). 

Proof  of  optimality  condition 

First  suppose  x  €  X  and  satisfies  (4.21).  Then  if  y  £  X  we  have,  by  (4.20), 
f0(y)  >  f0{x).  This  shows  x  is  an  optimal  point  for  (4.1). 

Conversely,  suppose  x  is  optimal,  but  the  condition  (4.21)  does  not  hold,  i.e., 
for  some  y  €  X  we  have 


Xfo(x)T(y  —  x)  <  0. 
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Consider  the  point  z(t)  =  ty+(l  —  t)x,  where  t  £  [0, 1]  is  a  parameter.  Since  z{t)  is 
on  the  line  segment  between  x  and  y,  and  the  feasible  set  is  convex,  z(t)  is  feasible. 
We  claim  that  for  small  positive  t,  we  have  fo(z(t))  <  fo{x),  which  will  prove  that 
x  is  not  optimal.  To  show  this,  note  that 

=  ^ fo{x)T {y  -  x)  <  0, 

t= 0 

so  for  small  positive  t,  we  have  fo(z(t))  <  fo(x). 

We  will  pursue  the  topic  of  optimality  conditions  in  much  more  depth  in  chap¬ 
ter  5,  but  here  we  examine  a  few  simple  examples. 

Unconstrained  problems 

For  an  unconstrained  problem  (i.e.,  m  =  p  =  0),  the  condition  (4.21)  reduces  to 
the  well  known  necessary  and  sufficient  condition 

V/o(:r)  =  0  (4.22) 

for  x  to  be  optimal.  While  we  have  already  seen  this  optimality  condition,  it  is 
useful  to  see  how  it  follows  from  (4.21).  Suppose  x  is  optimal,  which  means  here 
that  x  £  dom/o,  and  for  all  feasible  y  we  have  V fo{x)T{y  —  x)  >  0.  Since  /o  is 
differentiable,  its  domain  is  (by  definition)  open,  so  all  y  sufficiently  close  to  x  are 
feasible.  Let  us  take  y  —  x  —  tVfo(x),  where  t  £  R  is  a  parameter.  For  t  small  and 
positive,  y  is  feasible,  and  so 

Vfo(x)T(y  -x)  =  — 1||  V/0(cc)|||  >  0, 

from  which  we  conclude  S7  fo(x)  =  0. 

There  are  several  possible  situations,  depending  on  the  number  of  solutions 
of  (4.22).  If  there  are  no  solutions  of  (4.22),  then  there  are  no  optimal  points;  the 
optimal  value  of  the  problem  is  not  attained.  Here  we  can  distinguish  between 
two  cases:  the  problem  is  unbounded  below,  or  the  optimal  value  is  finite,  but  not 
attained.  On  the  other  hand  we  can  have  multiple  solutions  of  the  equation  (4.22), 
in  which  case  each  such  solution  is  a  minimizer  of  fo- 


Example  4.5  Unconstrained  quadratic  optimization.  Consider  the  problem  of  mini¬ 
mizing  the  quadratic  function 

fo(x)  =  (1/2)xtPx  +  qT  x  +  r, 

where  P  £  S+  (which  makes  fo  convex).  The  necessary  and  sufficient  condition  for 
r  to  be  a  minimizer  of  fo  is 

V fo{x )  =  Px  +  q  =  0. 

Several  cases  can  occur,  depending  on  whether  this  (linear)  equation  has  no  solutions, 
one  solution,  or  many  solutions. 

•  If  q^L  7Z(P),  then  there  is  no  solution.  In  this  case  fo  is  unbounded  below. 

•  If  P  >-  0  (which  is  the  condition  for  fo  to  be  strictly  convex),  then  there  is  a 
unique  minimizer,  x*  =  —P~1q. 
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•  If  P  is  singular,  but  q  £  1Z(P),  then  the  set  of  optimal  points  is  the  (affine)  set 
Xopt  =  —P^q  +  Af(P),  where  P *  denotes  the  pseudo-inverse  of  P  (see  §A.5.4). 


Example  4.6  Analytic  centering.  Consider  the  (unconstrained)  problem  of  minimiz¬ 
ing  the  (convex)  function  fo  :  R”  — >  R,  defined  as 

m 

fo(x)  =  log(6i  -  ajx),  dom/o  =  {x  \  Ax  <  &}, 

i= 1 

where  af , . . . ,  a ^  are  the  rows  of  A.  The  function  fo  is  differentiable,  so  the  necessary 
and  sufficient  conditions  for  x  to  be  optimal  are 

m 

Ax  -<  b ,  V/o(x)  =  t - ^  ai  =  0.  (4.23) 

bi  —  a\x 

i=  1 

(The  condition  Ax  -<  b  is  just  x  £  dom/o.)  If  Ax  -<  b  is  infeasible,  then  the  domain 
of  fo  is  empty.  Assuming  Ax  -<  b  is  feasible,  there  are  still  several  possible  cases  (see 
exercise  4.2): 

•  There  are  no  solutions  of  (4.23),  and  hence  no  optimal  points  for  the  problem. 
This  occurs  if  and  only  if  fo  is  unbounded  below. 

•  There  are  many  solutions  of  (4.23).  In  this  case  it  can  be  shown  that  the 
solutions  form  an  affine  set. 

•  There  is  a  unique  solution  of  (4.23),  i.e.,  a  unique  minimizer  of  fo .  This  occurs 
if  and  only  if  the  open  polyhedron  {x  \  Ax  -<  &}  is  nonempty  and  bounded. 


Problems  with  equality  constraints  only 

Consider  the  case  where  there  are  equality  constraints  but  no  inequality  constraints, 
i.e., 

minimize  fo  ( x ) 

subject  to  Ax  =  b. 

Here  the  feasible  set  is  affine.  We  assume  that  it  is  nonempty;  otherwise  the 
problem  is  infeasible.  The  optimality  condition  for  a  feasible  x  is  that 

^fo(x)T(y  -x)>0 

must  hold  for  all  y  satisfying  Ay  =  b.  Since  x  is  feasible,  every  feasible  y  has  the 
form  y  =  x  +  v  for  some  v  £  Af (A).  The  optimality  condition  can  therefore  be 
expressed  as: 

V fo{x)Tv  >  0  for  all  v  £  A f{A). 

If  a  linear  function  is  nonnegative  on  a  subspace,  then  it  must  be  zero  on  the 
subspace,  so  it  follows  that  V/0(x)Tu  =  0  for  all  v  £  Af(A).  In  other  words, 


V/0(a;)  -L  Af{A). 
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Using  the  fact  that  AT (A)1-  =  1Z(AT),  this  optimality  condition  can  be  expressed 
as  V/o(x)  £  1Z(AT),  i.e.,  there  exists  art  Rp  such  that 

V/0(*)  +  ATu  =  0. 

Together  with  the  requirement  Ax  =  b  (i.e.,  that  x  is  feasible),  this  is  the  classical 
Lagrange  multiplier  optimality  condition,  which  we  will  study  in  greater  detail  in 
chapter  5. 

Minimization  over  the  nonnegative  orthant 

As  another  example  we  consider  the  problem 

minimize  fo(x) 
subject  to  x  >:  0, 

where  the  only  inequality  constraints  are  nonnegativity  constraints  on  the  variables. 
The  optimality  condition  (4.21)  is  then 

x  y  0,  X7fo(x)T(y  —  x)  >  0  for  all  y  y  0. 

The  term  V/o(x)T  i/,  which  is  a  linear  function  of  y,  is  unbounded  below  on  y  >;  0, 
unless  we  have  V fo(x)  h  0.  The  condition  then  reduces  to  —  X7fo(x)Tx  >  0.  But 
x  y  0  and  V fo(x)  h  0,  so  we  must  have  X7fo(x)Tx  =  0,  i.e., 

n 

^(V/oW)a  =  0. 

j= 1 

Now  each  of  the  terms  in  this  sum  is  the  product  of  two  nonnegative  numbers,  so 
we  conclude  that  each  term  must  be  zero,  i.e.,  (\/fo(x))i  Xi  =  0  for  i  =  1, . . .  ,n. 
The  optimality  condition  can  therefore  be  expressed  as 

V/0(x)  y  0,  Xi  (Vf0(x))  •  =  0,  i  =  1, ...  ,n. 

The  last  condition  is  called  complementarity,  since  it  means  that  the  sparsity  pat¬ 
terns  (i.e.,  the  set  of  indices  corresponding  to  nonzero  components)  of  the  vectors  x 
and  Vfo(x)  are  complementary  (i.e.,  have  empty  intersection).  We  will  encounter 
complementarity  conditions  again  in  chapter  5. 


4.2.4  Equivalent  convex  problems 

It  is  useful  to  see  which  of  the  transformations  described  in  §4.1.3  preserve  convex¬ 
ity. 

Eliminating  equality  constraints 

For  a  convex  problem  the  equality  constraints  must  be  linear,  i.e.,  of  the  form 
Ax  =  b.  In  this  case  they  can  be  eliminated  by  finding  a  particular  solution  Xq  of 


4.2  Convex  optimization 


143 


Ax  =  b ,  and  a  matrix  F  whose  range  is  the  nullspace  of  A ,  which  results  in  the 
problem 

minimize  fo(Fz  +  Xo) 
subject  to  fi(Fz  +  x o)  <  0,  i  = 

with  variable  z.  Since  the  composition  of  a  convex  function  with  an  affine  func¬ 
tion  is  convex,  eliminating  equality  constraints  preserves  convexity  of  a  problem. 
Moreover,  the  process  of  eliminating  equality  constraints  (and  reconstructing  the 
solution  of  the  original  problem  from  the  solution  of  the  transformed  problem) 
involves  standard  linear  algebra  operations. 

At  least  in  principle,  this  means  we  can  restrict  our  attention  to  convex  opti¬ 
mization  problems  which  have  no  equality  constraints.  In  many  cases,  however,  it 
is  better  to  retain  the  equality  constraints,  since  eliminating  them  can  make  the 
problem  harder  to  understand  and  analyze,  or  ruin  the  efficiency  of  an  algorithm 
that  solves  it.  This  is  true,  for  example,  when  the  variable  x  has  very  large  dimen¬ 
sion,  and  eliminating  the  equality  constraints  would  destroy  sparsity  or  some  other 
useful  structure  of  the  problem. 

Introducing  equality  constraints 

We  can  introduce  new  variables  and  equality  constraints  into  a  convex  optimization 
problem,  provided  the  equality  constraints  are  linear,  and  the  resulting  problem 
will  also  be  convex.  For  example,  if  an  objective  or  constraint  function  has  the  form 
fi(AiX  +  bi),  where  A;  G  RfciX”,  we  can  introduce  a  new  variable  yt  G  Rfei,  replace 
fi(AiX  +  hi)  with  and  add  the  linear  equality  constraint  yt  =  A,a;  +  6,;. 

Slack  variables 

By  introducing  slack  variables  we  have  the  new  constraints  fi(x)  +  Sj  =  0.  Since 
equality  constraint  functions  must  be  affine  in  a  convex  problem,  we  must  have  fi 
affine.  In  other  words:  introducing  slack  variables  for  linear  inequalities  preserves 
convexity  of  a  problem. 

Epigraph  problem  form 

The  epigraph  form  of  the  convex  optimization  problem  (4.15)  is 

minimize  t 

subject  to  fo(x)  —  t  <  0 

fi{x)  <  0,  i  =  1, . . . ,  to 
ajx  =  bi,  i  =  l,...,p. 

The  objective  is  linear  (hence  convex)  and  the  new  constraint  function  fo(x)  —  t  is 
also  convex  in  ( x,t ),  so  the  epigraph  form  problem  is  convex  as  well. 

It  is  sometimes  said  that  a  linear  objective  is  universal  for  convex  optimization, 
since  any  convex  optimization  problem  is  readily  transformed  to  one  with  linear 
objective.  The  epigraph  form  of  a  convex  problem  has  several  practical  uses.  By 
assuming  the  objective  of  a  convex  optimization  problem  is  linear,  we  can  simplify 
theoretical  analysis.  It  can  also  simplify  algorithm  development,  since  an  algo¬ 
rithm  that  solves  convex  optimization  problems  with  linear  objective  can,  using 
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the  transformation  above,  solve  any  convex  optimization  problem  (provided  it  can 
handle  the  constraint  fo{x)  —  t  <  0). 

Minimizing  over  some  variables 

Minimizing  a  convex  function  over  some  variables  preserves  convexity.  Therefore, 
if  fo  in  (4.9)  is  jointly  convex  in  x\  and  X2,  and  /,,  i  =  1, . . .  ,m\,  and  /*,  i  = 
l,t. .  ,r?Z2,  are  convex,  then  the  equivalent  problem  (4.10)  is  convex. 


4.2.5  Quasiconvex  optimization 

Recall  that  a  quasiconvex  optimization  problem  has  the  standard  form 
minimize  fo(x) 

subject  to  fi(x)  <0,  i  =  1, . . . ,  m  (4.24) 

Ax  =  b , 

where  the  inequality  constraint  functions  fi, . . . ,  fm  are  convex,  and  the  objective 
fo  is  quasiconvex  (instead  of  convex,  as  in  a  convex  optimization  problem).  (Qua¬ 
siconvex  constraint  functions  can  be  replaced  with  equivalent  convex  constraint 
functions,  i.e.,  constraint  functions  that  are  convex  and  have  the  same  0-sublevel 
set,  as  in  §3.4.5.) 

In  this  section  we  point  out  some  basic  differences  between  convex  and  quasicon¬ 
vex  optimization  problems,  and  also  show  how  solving  a  quasiconvex  optimization 
problem  can  be  reduced  to  solving  a  sequence  of  convex  optimization  problems. 

Locally  optimal  solutions  and  optimality  conditions 

The  most  important  difference  between  convex  and  quasiconvex  optimization  is 
that  a  quasiconvex  optimization  problem  can  have  locally  optimal  solutions  that 
are  not  (globally)  optimal.  This  phenomenon  can  be  seen  even  in  the  simple  case 
of  unconstrained  minimization  of  a  quasiconvex  function  on  R,  such  as  the  one 
shown  in  figure  4.3. 

Nevertheless,  a  variation  of  the  optimality  condition  (4.21)  given  in  §4.2.3  does 
hold  for  quasiconvex  optimization  problems  with  differentiable  objective  function. 
Let  X  denote  the  feasible  set  for  the  quasiconvex  optimization  problem  (4.24).  It 
follows  from  the  first-order  condition  for  quasiconvexity  (3.20)  that  x  is  optimal  if 

x  S  X,  Vfo{x)T(y  —  x)  >  0  for  all  y  e  X  \  {a;}.  (4.25) 

There  are  two  important  differences  between  this  criterion  and  the  analogous 
one  (4.21)  for  convex  optimization: 

•  The  condition  (4.25)  is  only  sufficient  for  optimality;  simple  examples  show 
that  it  need  not  hold  for  an  optimal  point.  In  contrast,  the  condition  (4.21) 
is  necessary  and  sufficient  for  x  to  solve  the  convex  problem. 

•  The  condition  (4.25)  requires  the  gradient  of  fo  to  be  nonzero,  whereas  the 
condition  (4.21)  does  not.  Indeed,  when  V fo{x)  =  0  in  the  convex  case,  the 
condition  (4.21)  is  satisfied,  and  x  is  optimal. 
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Figure  4.3  A  quasiconvex  function  /  on  R,  with  a  locally  optimal  point  x 
that  is  not  globally  optimal.  This  example  shows  that  the  simple  optimality 
condition  f'(x)  =  0,  valid  for  convex  functions,  does  not  hold  for  quasiconvex 
functions. 


Quasiconvex  optimization  via  convex  feasibility  problems 

One  general  approach  to  quasiconvex  optimization  relies  on  the  representation  of 
the  sublevel  sets  of  a  quasiconvex  function  via  a  family  of  convex  inequalities,  as 
described  in  §3.4.5.  Let  </>t  :  Rn  — >  R,  t  £  R,  be  a  family  of  convex  functions  that 
satisfy 

f0(x)  <  t  4>t{x)  <  0, 

and  also,  for  each  x,  4>t{x)  is  a  nonincreasing  function  of  t,  i.e.,  <fis{x)  <  4> t(x ) 
whenever  s  >  t. 

Let  p*  denote  the  optimal  value  of  the  quasiconvex  optimization  problem  (4.24). 
If  the  feasibility  problem 

find  x 

subject  to  4>t{x)  <0  _ 

fi(x)  <  0,  i  =  1, . . .  ,m  ‘ 

Ax  =  b, 

is  feasible,  then  we  have  p*  <  t.  Conversely,  if  the  problem  (4.26)  is  infeasible,  then 
we  can  conclude  p*  >  t.  The  problem  (4.26)  is  a  convex  feasibility  problem,  since 
the  inequality  constraint  functions  are  all  convex,  and  the  equality  constraints 
are  linear.  Thus,  we  can  check  whether  the  optimal  value  p*  of  a  quasiconvex 
optimization  problem  is  less  than  or  more  than  a  given  value  t  by  solving  the 
convex  feasibility  problem  (4.26).  If  the  convex  feasibility  problem  is  feasible  then 
we  have  p*  <  t,  and  any  feasible  point  x  is  feasible  for  the  quasiconvex  problem 
and  satisfies  fo(x)  <  t.  If  the  convex  feasibility  problem  is  infeasible,  then  we  know 
that  p*  >t. 

This  observation  can  be  used  as  the  basis  of  a  simple  algorithm  for  solving  the 
quasiconvex  optimization  problem  (4.24)  using  bisection,  solving  a  convex  feasi¬ 
bility  problem  at  each  step.  We  assume  that  the  problem  is  feasible,  and  start 
with  an  interval  [l,u]  known  to  contain  the  optimal  value  p* .  We  then  solve  the 
convex  feasibility  problem  at  its  midpoint  t  =  (l  +  u)/ 2,  to  determine  whether  the 
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optimal  value  is  in  the  lower  or  upper  half  of  the  interval,  and  update  the  interval 
accordingly.  This  produces  a  new  interval,  which  also  contains  the  optimal  value, 
but  has  half  the  width  of  the  initial  interval.  This  is  repeated  until  the  width  of 
the  interval  is  small  enough: 


Algorithm  4.1  Bisection  method  for  quasiconvex  optimization. 

given  l  <  p* ,  u  >  p* ,  tolerance  e  >  0. 

repeat 

1.  t  :=  (1  +  u)/2. 

2.  Solve  the  convex  feasibility  problem  (4.26). 

3.  if  (4.26)  is  feasible,  u  :=  t;  else  l  :=  t. 
until  u—  l  <  e. 


The  interval  [Z,it]  is  guaranteed  to  contain  p*,  i.e.,  we  have  l  <  p*  <  u  at 
each  step.  In  each  iteration  the  interval  is  divided  in  two,  i.e.,  bisected,  so  the 
length  of  the  interval  after  k  iterations  is  2 ~k(u  —  l),  where  u  —  l  is  the  length  of 
the  initial  interval.  It  follows  that  exactly  |dog2((w  —  l)/e) ]  iterations  are  required 
before  the  algorithm  terminates.  Each  step  involves  solving  the  convex  feasibility 
problem  (4.26). 


4.3  Linear  optimization  problems 

When  the  objective  and  constraint  functions  are  all  affine,  the  problem  is  called  a 
linear  program  (LP).  A  general  linear  program  has  the  form 

minimize  cTx  +  d 

subject  to  Gx  A  h  (4.27) 

Ax  =  b, 

where  G  £  Rmx"  ancj  j\  g  RPX".  Linear  programs  are,  of  course,  convex  opti¬ 
mization  problems. 

It  is  common  to  omit  the  constant  d  in  the  objective  function,  since  it  does  not 
affect  the  optimal  (or  feasible)  set.  Since  we  can  maximize  an  affine  objective  cTx  + 
d,  by  minimizing  —cTx  —  d  (which  is  still  convex),  we  also  refer  to  a  maximization 
problem  with  affine  objective  and  constraint  functions  as  an  LP. 

The  geometric  interpretation  of  an  LP  is  illustrated  in  figure  4.4.  The  feasible 

set  of  the  LP  (4.27)  is  a  polyhedron  V\  the  problem  is  to  minimize  the  affine 

function  cTx  +  d  (or,  equivalently,  the  linear  function  cTx)  over  V . 

Standard  and  inequality  form  linear  programs 

Two  special  cases  of  the  LP  (4.27)  are  so  widely  encountered  that  they  have  been 
given  separate  names.  In  a  standard  form  LP  the  only  inequalities  are  componen- 
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Figure  4.4  Geometric  interpretation  of  an  LP.  The  feasible  set  "P,  which 
is  a  polyhedron,  is  shaded.  The  objective  cTx  is  linear,  so  its  level  curves 
are  hyperplanes  orthogonal  to  c  (shown  as  dashed  lines).  The  point  x *  is 
optimal;  it  is  the  point  in  V  as  far  as  possible  in  the  direction  — c. 


twise  nonnegativity  constraints  x  0: 

minimize  cTx 

subject  to  Ax  =  b  (4.28) 

x  >z  0. 


If  the  LP  has  no  equality  constraints,  it 
written  as 

minimize 
subject  to 


is  called  an  inequality  form  LP ,  usually 


T 

C  X 


Ax  -<  b. 


(4.29) 


Converting  LPs  to  standard  form 

It  is  sometimes  useful  to  transform  a  general  LP  (4.27)  to  one  in  standard  form  (4.28) 
(for  example  in  order  to  use  an  algorithm  for  standard  form  LPs).  The  first  step 
is  to  introduce  slack  variables  .s,  for  the  inequalities,  which  results  in 

minimize  cTx  +  d 
subject  to  Gx  +  s  =  h 
Ax  =  b 
s  y  0. 

The  second  step  is  to  express  the  variable  x  as  the  difference  of  two  nonnegative 
variables  x+  and  x~ ,  i.e.,  x  =  x+  —  x~,  x+ ,  x~  >:  0.  This  yields  the  problem 

minimize  cTx+  —  cTx~  +  d 
subject  to  Gx+  —  Gx~  +  s  =  h 
Ax+  —  Ax~  =  b 

x+  y  o,  or  y  o,  s  y  o, 
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which  is  an  LP  in  standard  form,  with  variables  x+ ,  x~ ,  and  s.  (For  equivalence 
of  this  problem  and  the  original  one  (4.27),  see  exercise  4.10.) 

These  techniques  for  manipulating  problems  (along  with  many  others  we  will 
see  in  the  examples  and  exercises)  can  be  used  to  formulate  many  problems  as  linear 
programs.  With  some  abuse  of  terminology,  it  is  common  to  refer  to  a  problem 
that  can  be  formulated  as  an  LP  as  an  LP,  even  if  it  does  not  have  the  form  (4.27). 


4.3.1  Examples 

LPs  arise  in  a  vast  number  of  fields  and  applications;  here  we  give  a  few  typical 
examples. 

Diet  problem 

A  healthy  diet  contains  in  different  nutrients  in  quantities  at  least  equal  to  b\ ,  . . . , 
bm.  We  can  compose  such  a  diet  by  choosing  nonnegative  quantities  x\, . . . ,  xn  of 
n  different  foods.  One  unit  quantity  of  food  j  contains  an  amount  al3  of  nutrient 
i,  and  has  a  cost  of  Cj.  We  want  to  determine  the  cheapest  diet  that  satisfies  the 
nutritional  requirements.  This  problem  can  be  formulated  as  the  LP 

minimize  cTx 
subject  to  Ax  h  b 
xhO- 

Several  variations  on  this  problem  can  also  be  formulated  as  LPs.  For  example, 
we  can  insist  on  an  exact  amount  of  a  nutrient  in  the  diet  (which  gives  a  linear 
equality  constraint),  or  we  can  impose  an  upper  bound  on  the  amount  of  a  nutrient, 
in  addition  to  the  lower  bound  as  above. 

Chebyshev  center  of  a  polyhedron 

We  consider  the  problem  of  finding  the  largest  Euclidean  ball  that  lies  in  a  poly¬ 
hedron  described  by  linear  inequalities, 

V  =  {x  €  R"  |  ajx  <  bi,  i  =  l,...,  to}. 

(The  center  of  the  optimal  ball  is  called  the  Chebyshev  center  of  the  polyhedron; 
it  is  the  point  deepest  inside  the  polyhedron,  i.e.,  farthest  from  the  boundary; 
see  §8.5.1.)  We  represent  the  ball  as 

B  =  {xc  +  u  |  \\u\\2  <  r}. 

The  variables  in  the  problem  are  the  center  xc  £  R"  and  the  radius  r;  we  wish  to 
maximize  r  subject  to  the  constraint  B  C  V . 

We  start  by  considering  the  simpler  constraint  that  B  lies  in  one  halfspace 
af  x  <  bi,  i.e., 

IMI2  <  r  =>  af(xc  +  u)  <  bi.  (4.30) 

Since 

sup{af u  |  ||it||2  <r}  =  r\\ai\\2 
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we  can  write  (4.30)  as 

af xc  +  r\\ai\\2  <bi,  (4-31) 

which  is  a  linear  inequality  in  xc  and  r.  In  other  words,  the  constraint  that  the 
ball  lies  in  the  halfspace  determined  by  the  inequality  ajx  <  b-i  can  be  written  as 
a  linear  inequality. 

Therefore  B  C  V  if  and  only  if  (4.31)  holds  for  all  i  =  l,...,m.  Hence  the 
Chebysliev  center  can  be  determined  by  solving  the  LP 

maximize  r 

subject  to  ajxc  +  r* 1 1 a.*  1 1 2  <  bi,  i  =  1, . . . ,  m, 
with  variables  r  and  xc.  (For  more  on  the  Chebyshev  center,  see  §8.5.1.) 

Dynamic  activity  planning 

We  consider  the  problem  of  choosing,  or  planning,  the  activity  levels  of  n  activities, 
or  sectors  of  an  economy,  over  N  time  periods.  We  let  Xj(t)  >  0,  t  =  1, . . . ,  N, 
denote  the  activity  level  of  sector  j.  in  period  t.  The  activities  both  consume  and 
produce  products  or  goods  in  proportion  to  their  activity  levels.  The  amount  of 
good  i  produced  per  unit  of  activity  j  is  given  by  a,,' .  Similarly,  the  amount  of  good  i 
consumed  per  unit  of  activity  j  is  bij .  The  total  amount  of  goods  produced  in  period 
t  is  given  by  Ax{t)  £  Rm,  and  the  amount  of  goods  consumed  is  Bx{t)  £  Rm. 
(Although  we  refer  to  these  products  as  ‘goods’,  they  can  also  include  unwanted 
products  such  as  pollutants.) 

The  goods  consumed  in  a  period  cannot  exceed  those  produced  in  the  previous 
period:  we  must  have  Bx(t  +  1)  A  Ax{t)  for  t  =  1, . . .  .,  AL  A  vector  go  £  Rm  of 
initial  goods  is  given,  which  constrains  the  first  period  activity  levels:  Bx{  1)  A  g0. 
The  (vectors  of)  excess  goods  not  consumed  by  the  activities  are  given  by 

s(0)  =  g0  -  Bx(  1) 

s(t)  =  Ax(t)  —  Bx(t  +  1),  t=l,...,N—  1 

s(N)  =  Ax(N). 

The  objective  is  to  maximize  a  discounted  total  value  of  excess  goods: 

cTs(0)  +  7CTs(1)  4 - +  7 ncts(N ), 

where  c  £  Rm  gives  the  values  of  the  goods,  and  7  >  0  is  a  discount  factor.  (The 
value  Ci  is  negative  if  the  ith  product  is  unwanted,  e.g.,  a  pollutant;  \ci\  is  then  the 
cost  of  disposal  per  unit.) 

Putting  it  all  together  we  arrive  at  the  LP 

maximize  cTs(0)  +  7CTs(l)  +  •  •  •  +  cT  s(N) 

subject  to  x[t)  >7  0,  t  =  1, . . . ,  N 
s(t)  h0,  t  =  0, . . . ,  N 
s(0)  =  go-  Bx(l) 

s(t)  =  Ax(t)  —  Bx(t  +  1),  t  =  1, . . . ,  N  —  1 
s(N)  =  Ax(N), 

with  variables  x(l), . . . ,  x(N),  s(0), . . . ,  s(N).  This  problem  is  a  standard  form  LP; 
the  variables  s(t)  are  the  slack  variables  associated  with  the  constraints  Bx(t+1)  A 
Ax(t). 
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Chebyshev  inequalities 

We  consider  a  probability  distribution  for  a  discrete  random  variable  i  on  a  set 
{wi, . . .  ,un}  C  R  with  n  elements.  We  describe  the  distribution  of  a:  by  a  vector 
p  G  R  where 

Pi  =  prob(a;  =  tq), 

so  p  satisfies  p>  0  and  1 1  p  =  1.  Conversely,  if  p  satisfies  p  >:  0  and  1  Tp  =  1,  then 
it  defines  a  probability  distribution  for  x.  We  assume  that  Ui  are  known  and  fixed, 
but  the  distribution  p  is  not  known. 

If  /  is  any  function  of  x,  then 

n 

e  /  =  5>/(«o 

i= 1 

is  a  linear  function  of  p.  If  S  is  any  subset  of  R,  then 

prob(x  G  S)  =  E  Pi 

Ui€S 

is  a  linear  function  of  p. 

Although  we  do  not  know  p,  we  are  given  prior  knowledge  of  the  following  form: 
We  know  upper  and  lower  bounds  on  expected  values  of  some  functions  of  x,  and 
probabilities  of  some  subsets  of  R.  This  prior  knowledge  can  be  expressed  as  linear 
inequality  constraints  on  p , 

oti<aJp<  pi,  i  =  1, . . .  ,m. 

The  problem  is  to  give  lower  and  upper  bounds  on  E  fo(x)  =  a^p,  where  /o  is  some 
function  of  x. 

To  find  a  lower  bound  we  solve  the  LP 

minimize  a^p 

subject  to  p  y  0,  lTp  =  1 

ai<afp<pi,  i  =  1, . . .  ,m, 

with  variable  p.  The  optimal  value  of  this  LP  gives  the  lowest  possible  value  of 
E  fo(X)  for  any  distribution  that  is  consistent  with  the  prior  information.  More¬ 
over,  the  bound  is  sharp:  the  optimal  solution  gives  a  distribution  that  is  consistent 
with  the  prior  information  and  achieves  the  lower  bound.  In  a  similar  way,  we  can 
find  the  best  upper  bound  by  maximizing  a^p  subject  to  the  same  constraints.  (We 
will  consider  Chebyshev  inequalities  in  more  detail  in  §7.4.1.) 

Piecewise-linear  minimization 

Consider  the  (unconstrained)  problem  of  minimizing  the  piecewise-linear,  convex 
function 

f(x)  =  max  (afx  +  bi). 

This  problem  can  be  transformed  to  an  equivalent  LP  by  first  forming  the  epigraph 
problem, 

minimize  t 

subject  to  maxj=ii...;m(afa;  +  bi)  <  t, 
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and  then  expressing  the  inequality  as  a  set  of  m  separate  inequalities: 
minimize  t 

subject  to  ajx  +  bi  <  t,  i  =  1, . . .  ,m. 

This  is  an  LP  (in  inequality  form),  with  variables  x  and  t. 


4.3.2  Linear-fractional  programming 

The  problem  of  minimizing  a  ratio  of  affine  functions  over  a  polyhedron  is  called  a 
linear-fractional  program : 

minimize  fo  (x) 

subject  to  Gx  A  h  (4.32) 

Ax  =  b 

where  the  objective  function  is  given  by 

fo{x)  =  dom/0  =  {x  \  eTx  +  f  >  0}. 

e1  x  +  j 

The  objective  function  is  quasiconvex  (in  fact,  quasilinear)  so  linear-fractional  pro¬ 
grams  are  quasiconvex  optimization  problems. 


Transforming  to  a  linear  program 


If  the  feasible  set 

{x  |  Gx  A  h,  Ax  =  b,  eTx  +  f  >  0} 

is  nonempty,  the  linear-fractional  program  (4.32)  can  be  transformed  to  an  equiv¬ 
alent  linear  program 

minimize  cTy  +  dz 
subject  to  Gy  —  hz  ^  0 

Ay-bz  =  0  (4.33) 

eTy  +  fz  =  1 

z>0 


with  variables  y,  z. 

To  show  the  equivalence,  we  first  note  that  if  x  is  feasible  in  (4.32)  then  the 
pair 

x  1 

^  eTx  +  /  ’  Z  eTx  +  f 

is  feasible  in  (4.33),  with  the  same  objective  value  cTy  +  dz  =  fo{x).  It  follows  that 
the  optimal  value  of  (4.32)  is  greater  than  or  equal  to  the  optimal  value  of  (4.33). 

Conversely,  if  (y,z)  is  feasible  in  (4.33),  with  z  yt  0,  then  x  =  y/z  is  feasible 
in  (4.32),  with  the  same  objective  value  fo(x)  =  cTy  +  dz.  If  (y,z)  is  feasible 
in  (4.33)  with  z  =  0,  and  Xo  is  feasible  for  (4.32),  then  x  =  Xq  +  ty  is  feasible 
in  (4.32)  for  all  t  >  0.  Moreover,  lim^oo  /0(xo  +  ty)  =  cTy  +  dz,  so  we  can  find 
feasible  points  in  (4.32)  with  objective  values  arbitrarily  close  to  the  objective  value 
of  (y,  z).  We  conclude  that  the  optimal  value  of  (4.32)  is  less  than  or  equal  to  the 
optimal  value  of  (4.33). 
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Generalized  linear-fractional  programming 

A  generalization  of  the  linear-fractional  program  (4.32)  is  the  generalized  linear- 
fractional  program  in  which 


fo{x) 


cf  x  +  di 

max 

*=i.-.r  e\  X  +  fi 


dom/o  =  {x  |  efx  +  fi  >0,  i  =  1, ...  ,r}. 


The  objective  function  is  the  pointwise  maximum  of  r  quasiconvex  functions,  and 
therefore  quasiconvex,  so  this  problem  is  quasiconvex.  When  r  =  I  it  reduces  to 
the  standard  linear-fractional  program. 


Example  4.7  Von  Neumann  growth  problem.  We  consider  an  economy  with  n 
sectors,  and  activity  levels  Xi  >  0  in  the  current  period,  and  activity  levels  xf  >  0  in 
the  next  period.  (In  this  problem  we  only  consider  one  period.)  There  are  m  goods 
which  are  consumed,  and  also  produced,  by  the  activity:  An  activity  level  x  consumes 
goods  Bx  £  Rm,  and  produces  goods  Ax.  The  goods  consumed  in  the  next  period 
cannot  exceed  the  goods  produced  in  the  current  period,  i.e.,  Bx+  A  Ax.  The  growth 
rate  in  sector  i,  over  the  period,  is  given  by  xf  /xi. 

Von  Neumann’s  growth  problem  is  to  find  an  activity  level  vector  x  that  maximizes 
the  minimum  growth  rate  across  all  sectors  of  the  economy.  This  problem  can  be 
expressed  as  a  generalized  linear-fractional  problem 

maximize  miiii= i  ,...,nxf  /xt 
subject  to  x+  A  0 

Bx+  A  Ax 

with  domain  {(x,  x+)  |  x  A  0}.  Note  that  this  problem  is  homogeneous  in  x  and  x+ , 
so  we  can  replace  the  implicit  constraint  x  A  0  by  the  explicit  constraint  x  A  1. 


4.4  Quadratic  optimization  problems 

The  convex  optimization  problem  (4.15)  is  called  a  quadratic  program  (QP)  if  the 
objective  function  is  (convex)  quadratic,  and  the  constraint  functions  are  affine.  A 
quadratic  program  can  be  expressed  in  the  form 

minimize  (1/2  )xT  Px  +  qTx  +  r 

subject  to  Gx  A  h  (4.34) 

Ax  =  6, 

where  P  £  S",  G  £  Rmxn,  and  A  £  Rpx™.  In  a  quadratic  program,  we  minimize 
a  convex  quadratic  function  over  a  polyhedron,  as  illustrated  in  figure  4.5. 

If  the  objective  in  (4.15)  as  well  as  the  inequality  constraint  functions  are  (con¬ 
vex)  quadratic,  as  in 

minimize  (1/2)xtPqX  +  qffx  +  Tq 
subject  to  (l/2)xT PiX  +  qfx  +  r.i  <  0,  i  =  l,...,m 
Ax  =  6, 


(4.35) 
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Figure  4.5  Geometric  illustration  of  QP.  The  feasible  set  V,  which  is  a  poly¬ 
hedron,  is  shown  shaded.  The  contour  lines  of  the  objective  function,  which 
is  convex  quadratic,  are  shown  as  dashed  curves.  The  point  x*  is  optimal. 


where  Pj  G  S",  i  =  0, 1 . . .  ,m,  the  problem  is  called  a  quadratically  constrained 
quadratic  program  (QCQP).  In  a  QCQP,  we  minimize  a  convex  quadratic  function 
over  a  feasible  region  that  is  the  intersection  of  ellipsoids  (when  Pj  >-  0). 

Quadratic  programs  include  linear  programs  as  a  special  case,  by  taking  P  =  0 
in  (4.34).  Quadratically  constrained  quadratic  programs  include  quadratic  pro¬ 
grams  (and  therefore  also  linear  programs)  as  a  special  case,  by  taking  Pj  =  0 
in  (4.35),  for  i  =  1, . . . ,  m. 


4.4.1  Examples 

Least-squares  and  regression 

The  problem  of  minimizing  the  convex  quadratic  function 
||  Ax  —  &H2  =  xT  AtAx  —  2bT  Ax  +  bTb 

is  an  (unconstrained)  QP.  It  arises  in  many  fields  and  has  many  names,  e.g.,  re¬ 
gression  analysis  or  least-squares  approximation.  This  problem  is  simple  enough  to 
have  the  well  known  analytical  solution  x  =  A'b,  where  is  the  pseudo- inverse 
of  A  (see  §A.5.4). 

When  linear  inequality  constraints  are  added,  the  problem  is  called  constrained 
regression  or  constrained  least-squares,  and  there  is  no  longer  a  simple  analytical 
solution.  As  an  example  we  can  consider  regression  with  lower  and  upper  bounds 
on  the  variables,  i.e., 


minimize  ||  Ax  —  6||| 

subject  to  U  <  Xi  <  Ui,  i  =  1, . . . ,  n, 
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which  is  a  QP.  (We  will  study  least-squares  and  regression  problems  in  far  more 
depth  in  chapters  6  and  7.) 

Distance  between  polyhedra 

The  (Euclidean)  distance  between  the  polyhedra  V\  =  {x  \  A\X  <  b{\  and  V2  = 
{x  |  A2x  A  b2}  in  R”  is  defined  as 

dist(77i,  V2)  =  inf{||xi  -  x2\\2  \  aq  G  V\,  x2  G  V2j. 

If  the  polyhedra  intersect,  the  distance  is  zero. 

To  find  the  distance  between  V\  and  V2,  we  can  solve  the  QP 

minimize  ||aq  —  £2  III 

subject  to  AiX\  -<  b\,  A2x 2  A  b2l 

with  variables  aq ,  x2  £  R™.  This  problem  is  infeasible  if  and  only  if  one  of  the 
polyhedra  is  empty.  The  optimal  value  is  zero  if  and  only  if  the  polyhedra  intersect, 
in  which  case  the  optimal  aq  and  x2  are  equal  (and  is  a  point  in  the  intersection 
V\  rV2).  Otherwise  the  optimal  aq  and  x2  are  the  points  in  V\  and  V2,  respectively, 
that  are  closest  to  each  other.  (We  will  study  geometric  problems  involving  distance 
in  more  detail  in  chapter  8.) 

Bounding  variance 

We  consider  again  the  Chebyshev  inequalities  example  (page  150),  where  the  vari¬ 
able  is  an  unknown  probability  distribution  given  by  p  G  Rra,  about  which  we  have 
some  prior  information.  The  variance  of  a  random  variable  f(x)  is  given  by 

n  /  n 

e/2-(e  /)2  =  ^/2ft_ 

2=1  \i=l 

(where  fi  =  /(«*)),  which  is  a  concave  quadratic  function  of  p. 

It  follows  that  we  can  maximize  the  variance  of  f(x),  subject  to  the  given  prior 
information,  by  solving  the  QP 

maximize  ^=1  fiPi  -  (El Li  fiPif 
subject  to  p  y  0,  lTp  =  1 

ai<afp<pi,  i  =  1, . . . ,  m. 

The  optimal  value  gives  the  maximum  possible  variance  of  f(x ),  over  all  distribu¬ 
tions  that  are  consistent  with  the  prior  information;  the  optimal  p  gives  a  distri¬ 
bution  that  achieves  this  maximum  variance. 

Linear  program  with  random  cost 

We  consider  an  LP, 

minimize  cTx 
subject  to  Gx  A  h 
Ax  =  6, 
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with  variable  x  £  Rn.  We  suppose  that  the  cost  function  (vector)  c  £  R"  is 
random ,  with  mean  value  c  and  covariance  E(c  —  c)(c  —  c)T  =  E.  (We  assume 
for  simplicity  that  the  other  problem  parameters  are  deterministic.)  For  a  given 
x  £  R",  the  cost  cTx  is  a  (scalar)  random  variable  with  mean  EcTx  =  cT x  and 
variance 

var(cTa;)  =  E(cra;  —  Ecti)2  =  xtT,x. 

In  general  there  is  a  trade-off  between  small  expected  cost  and  small  cost  vari¬ 
ance.  One  way  to  take  variance  into  account  is  to  minimize  a  linear  combination 
of  the  expected  value  and  the  variance  of  the  cost,  i.e., 

Ectx  +  7var(cTa:), 

which  is  called  the  risk-sensitive  cost.  The  parameter  7  >  0  is  called  the  risk- 
aversion  parameter ,  since  it  sets  the  relative  values  of  cost  variance  and  expected 
value.  (For  7  >  0,  we  are  willing  to  trade  off  an  increase  in  expected  cost  for  a 
sufficiently  large  decrease  in  cost  variance). 

To  minimize  the  risk-sensitive  cost  we  solve  the  QP 

minimize  cT  x  +  'yxTT,x 
subject  to  Gx  7!  h 
Ax  =  b. 

Markowitz  portfolio  optimization 

We  consider  a  classical  portfolio  problem  with  n  assets  or  stocks  held  over  a  period 
of  time.  We  let  27  denote  the  amount  of  asset  i  held  throughout  the  period,  with 
27  in  dollars,  at  the  price  at  the  beginning  of  the  period.  A  normal  long  position 
in  asset  i  corresponds  to  Xi  >  0;  a  short  position  in  asset  i  ( i.e .,  the  obligation  to 
buy  the  asset  at  the  end  of  the  period)  corresponds  to  27  <  0.  We  let  p,  denote 
the  relative  price  change  of  asset  i  over  the  period,  i.e.,  its  change  in  price  over 
the  period  divided  by  its  price  at  the  beginning  of  the  period.  The  overall  return 
on  the  portfolio  is  r  =  pTx  (given  in  dollars).  The  optimization  variable  is  the 
portfolio  vector  x  £  Rra. 

A  wide  variety  of  constraints  on  the  portfolio  can  be  considered.  The  simplest 
set  of  constraints  is  that  27  >  0  (i.e.,  no  short  positions)  and  lr2i  =  B  (i.e.,  the 
total  budget  to  be  invested  is  B,  which  is  often  taken  to  be  one). 

We  take  a  stochastic  model  for  price  changes:  p  £  R™  is  a  random  vector,  with 
known  mean  p  and  covariance  E.  Therefore  with  portfolio  x  £  R™,  the  return  r 
is  a  (scalar)  random  variable  with  mean  pT  x  and  variance  xTY,x.  The  choice  of 
portfolio  x  involves  a  trade-off  between  the  mean  of  the  return,  and  its  variance. 
The  classical  portfolio  optimization  problem,  introduced  by  Markowitz,  is  the 

QP 

minimize  xTYiX 
subject  to  pT x  >  ?’min 

lr2;  =1,  x  >7  0, 

where  x,  the  portfolio,  is  the  variable.  Here  we  find  the  portfolio  that  minimizes 
the  return  variance  (which  is  associated  with  the  risk  of  the  portfolio)  subject  to 
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achieving  a  minimum  acceptable  mean  return  rmjn,  and  satisfying  the  portfolio 
budget  and  no-shorting  constraints. 

Many  extensions  are  possible.  One  standard  extension,  for  example,  is  to  allow 
short  positions,  i.e.,  Xi  <  0.  To  do  this  we  introduce  variables  a;iong  and  £short, 
with 


long  O5  ^short  0? 


X  —  belong  -Cshort . 


1  .Cs|,ort  yl  -Clong  ■ 


The  last  constraint  limits  the  total  short  position  at  the  beginning  of  the  period  to 
some  fraction  of  the  total  long  position  at  the  beginning  of  the  period. 

As  another  extension  we  can  include  linear  transaction  costs  in  the  portfolio 
optimization  problem.  Starting  from  a  given  initial  portfolio  X;nit  we  buy  and  sell 
assets  to  achieve  the  portfolio  x,  which  we  then  hold  over  the  period  as  described 
above.  We  are  charged  a  transaction  fee  for  buying  and  selling  assets,  which  is 
proportional  to  the  amount  bought  or  sold.  To  handle  this,  we  introduce  variables 
u buy  and  Uygii ,  which  determine  the  amount  of  each  asset  we  buy  and  sell  before 
the  holding  period.  We  have  the  constraints 


X  ^init  T  ^buy  ^selb  ^buy  0,  ^sell  Cl  0* 

We  replace  the  simple  budget  constraint  lTx  =  1  with  the  condition  that  the  initial 
buying  and  selling,  including  transaction  fees,  involves  zero  net  cash: 

(1  /sell)l  ^sell  =  (1  T  /buy)l  ^-buy 

Here  the  lefthand  side  is  the  total  proceeds  from  selling  assets,  less  the  selling 
transaction  fee,  and  the  righthand  side  is  the  total  cost,  including  transaction  fee, 
of  buying  assets.  The  constants  /buy  >  0  and  /sen  >  0  are  the  transaction  fee  rates 
for  buying  and  selling  (assumed  the  same  across  assets,  for  simplicity). 

The  problem  of  minimizing  return  variance,  subject  to  a  minimum  mean  return, 
and  the  budget  and  trading  constraints,  is  a  QP  with  variables  x,  Ubuy,  itsen- 


4.4.2  Second-order  cone  programming 

A  problem  that  is  closely  related  to  quadratic  programming  is  the  second-order 
cone  program  (SOCP): 

minimize  fTx 

subject  to  \\AiX  +  bi\\2  <  cf x  +  di,  i=l,...,m  (4.36) 

Fx  =  g, 

where  x  £  R™  is  the  optimization  variable,  Aj  £  Rn,xn,  and  F  £  Rpxn.  We  call  a 
constraint  of  the  form 

\\Ax  +  b\\2  <  cTx  +  d, 

where  A  £  Rfexn,  a  second- order  cone  constraint ,  since  it  is  the  same  as  requiring 
the  affine  function  (Ax  +  b,  cTx  +  d)  to  lie  in  the  second-order  cone  in  Rfc+1. 

When  Cj  =  0,  i  =  1, . . .  ,m,  the  SOCP  (4.36)  is  equivalent  to  a  QCQP  (which 
is  obtained  by  squaring  each  of  the  constraints).  Similarly,  if  Aj  =  0,  i  =  1, . . . ,  m, 
then  the  SOCP  (4.36)  reduces  to  a  (general)  LP.  Second-order  cone  programs  are, 
however,  more  general  than  QCQPs  (and  of  course,  LPs). 
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Robust  linear  programming 

We  consider  a  linear  program  in  inequality  form, 

minimize  cTx 

subject  to  afx  <  bi,  i  =  1, . . .  ,m, 

in  which  there  is  some  uncertainty  or  variation  in  the  parameters  c,  Oj,  bi.  To 
simplify  the  exposition  we  assume  that  c  and  bi  are  fixed,  and  that  are  known 
to  lie  in  given  ellipsoids: 


^  G  Si  =  {ai  +  Pin  |  ||u||2  <  1}, 


where  Pi  G  Rnxn.  (If  Pi  is  singular  we  obtain  ‘flat’  ellipsoids,  of  dimension  rank  Pt ; 
Pi  =  0  means  that  ai  is  known  perfectly.) 

We  will  require  that  the  constraints  be  satisfied  for  all  possible  values  of  the 
parameters  ai,  which  leads  us  to  the  robust  linear  program 


minimize  cTx 

subject  to  af  x  <  bi  for  all  ai  G  Si,  i  =  1, . . . ,  m. 


(4.37) 


The  robust  linear  constraint,  afx  <  bi  for  all  ai  £  Si,  can  be  expressed  as 


supjaf  x  |  at  G  Si}  <  bi, 


the  lefthand  side  of  which  can  be  expressed  as 

sup{af:r  |  ai  G  Si}  =  afx  +  sup{uT  P}f  x  |  ||u||2  <  1} 

=  afx+\\Pfx\\2. 

Thus,  the  robust  linear  constraint  can  be  expressed  as 

afx+\\Pfx\\2  <  bi, 

which  is  evidently  a  second-order  cone  constraint.  Hence  the  robust  LP  (4.37)  can 
be  expressed  as  the  SO  CP 

minimize  cTx 

subject  to  afx  +  ||P/’a;||2  <  bi,  i  =  1,. . .  ,m. 

Note  that  the  additional  norm  terms  act  as  regularization  terms;  they  prevent  x 
from  being  large  in  directions  with  considerable  uncertainty  in  the  parameters  aj. 

Linear  programming  with  random  constraints 

The  robust  LP  described  above  can  also  be  considered  in  a  statistical  framework. 
Here  we  suppose  that  the  parameters  a,  are  independent  Gaussian  random  vectors, 
with  mean  Hi  and  covariance  Ej.  We  require  that  each  constraint  af  x  <  bi  should 
hold  with  a  probability  (or  confidence)  exceeding  77,  where  g  >  0.5,  i.e., 

prob(afcc  <  bi)  >  77. 


(4.38) 
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We  will  show  that  this  probability  constraint  can  be  expressed  as  a  second-order 
cone  constraint. 

Letting  u  =  ajx,  with  <j2  denoting  its  variance,  this  constraint  can  be  written 


as 


prob 


u  —  u 

<7 


< 


>  V- 


Since  (u  —  u)/o  is  a  zero  mean  unit  variance  Gaussian  variable,  the  probability 
above  is  simply  <F((&j  —  u)/a ),  where 


$(2) 


2/*dt 


is  the  cumulative  distribution  function  of  a  zero  mean  unit  variance  Gaussian  ran¬ 
dom  variable.  Thus  the  probability  constraint  (4.38)  can  be  expressed  as 


>  $  1(v), 


or,  equivalently, 

u  +  $~1(77)er  <  bi . 

From  u  =  ajx  and  a  =  (xT'Eix)1/2  we  obtain 


afx  +  4>  1(??)||S-/2a:||2  <  6*. 

By  our  assumption  that  rj  >  1/2,  we  have  >  0,  so  this  constraint  is  a 

second-order  cone  constraint. 

In  summary,  the  problem 

minimize  cTx 

subject  to  prob(af  x  <  bi)  >  77,  i  =  1, . . .  ,m 


can  be  expressed  as  the  SOCP 
minimize  cTx 

subject  to  afa;  +  4>_1(?7)||Sy2a;||2  <  hj,  i  = 


(We  will  consider  robust  convex  optimization  problems  in  more  depth  in  chapter  6. 
See  also  exercises  4.13,  4.28,  and  4.59.) 


Example  4.8  Portfolio  optimization  with  loss  risk  constraints.  We  consider  again  the 
classical  Markowitz  portfolio  problem  described  above  (page  155).  We  assume  here 
that  the  price  change  vector  p  £  R"  is  a  Gaussian  random  variable,  with  mean  p 
and  covariance  E.  Therefore  the  return  r  is  a  Gaussian  random  variable  with  mean 
r  =  pT x  and  variance  a =  xT'Px. 

Consider  a  loss  risk  constraint  of  the  form 

prob(r  <  a)  <  /3,  (4.39) 

where  a  is  a  given  unwanted  return  level  (e.g.,  a  large  loss)  and  j3  is  a  given  maximum 
probability. 
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As  in  the  stochastic  interpretation  of  the  robust  LP  given  above,  we  can  express  this 
constraint  using  the  cumulative  distribution  function  <£>  of  a  unit  Gaussian  random 
variable.  The  inequality  (4.39)  is  equivalent  to 

pT x  +  4>_1(/3)  ||S1/,2*||2  >  a. 

Provided  p  <  1/2  ( i.e .,  4>_1(/3)  <  0),  this  loss  risk  constraint  is  a  second-order  cone 
constraint.  (If  P  >  1/2,  the  loss  risk  constraint  becomes  nonconvex  in  x.) 

The  problem  of  maximizing  the  expected  return  subject  to  a  bound  on  the  loss 
risk  (with  p  <  1/2),  can  therefore  be  cast  as  an  SOCP  with  one  second-order  cone 
constraint: 

maximize  pTx 

subject  to  pT x  +  4>_1(/3)  ||S1/2a:||2  >  a 
x  y  0,  1T x  =  1. 

There  are  many  extensions  of  this  problem.  For  example,  we  can  impose  several  loss 
risk  constraints,  i.e., 

prob(r  <  on)  <  Pi,  i  =  1, . . . ,  k, 

(where  Pi  <  1/2),  which  expresses  the  risks  (Pi)  we  are  willing  to  accept  for  various 
levels  of  loss  (ap. 


Minimal  surface 

Consider  a  differentiable  function  /  :  R2  — >■  R  with  dom  /  =  C.  The  surface  area 
of  its  graph  is  given  by 


A  =  Jc  ^1  +  1^/(20111  dx  =  ||  (V/(z),  1)||2  dx, 

which  is  a  convex  functional  of  /.  The  minimal  surface  problem  is  to  find  the 
function  /  that  minimizes  A  subject  to  some  constraints,  for  example,  some  given 
values  of  /  on  the  boundary  of  C. 

We  will  approximate  this  problem  by  discretizing  the  function  /.  Let  C  = 
[0, 1]  x  [0, 1],  and  let  fy  denote  the  value  of  /  at  the  point  ( i/K,j/K ),  for  i,  j  = 
0, . . . ,  K .  An  approximate  expression  for  the  gradient  of  /  at  the  point  x  = 
(i/K,j/K)  can  be  found  using  forward  differences: 


V/(s)  «  K 


fi+l,j  Ji,j 
Ji,j+ 1  Ji,j 


Substituting  this  into  the  expression  for  the  area  of  the  graph,  and  approximating 
the  integral  as  a  sum,  we  obtain  an  approximation  for  the  area  of  the  graph: 


A 


A 


disc 


l 


K- 1 

E 


i,j= 0 


K(fi+ x,j 
K(fi,j+ 1 
1 


fi,j) 

fi,j) 


2 


The  discretized  area  approximation  Adisc  is  a  convex  function  of  /,; j . 

We  can  consider  a  wide  variety  of  constraints  on  /,; j ,  such  as  equality  or  in¬ 
equality  constraints  on  any  of  its  entries  (for  example,  on  the  boundary  values),  or 
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on  its  moments.  As  an  example,  we  consider  the  problem  of  finding  the  minimal 
area  surface  with  fixed  boundary  values  on  the  left  and  right  edges  of  the  square: 


minimize  A<jiSc 

subject  to  foj  =  lj,  j  =  0, ...  ,K  (4.40) 

J'kj  =  r:n  j  =  0,,..,K 

where  fij,  i,j  =  0, . . . ,  K ,  are  the  variables,  and  lj,  Tj  are  the  given  boundary 
values  on  the  left  and  right  sides  of  the  square. 

We  can  transform  the  problem  (4.40)  into  an  SOCP  by  introducing  new  vari¬ 
ables  tij ,  i,  j  =  0, . . . ,  K  —  1: 


minimize 
subject  to 


K(fi+l,j  —  fi,j) 

1  -  fij) 

l  1  J  2 

foj  =  lj ,  j  =  0,..,,K 
f Kj  —  Tj ,  j=0,...,K. 


i,  j  =  0, . . . ,  K  -  1 


4.5  Geometric  programming 

In  this  section  we  describe  a  family  of  optimization  problems  that  are  not  convex 
in  their  natural  form.  These  problems  can,  however,  be  transformed  to  convex  op¬ 
timization  problems,  by  a  change  of  variables  and  a  transformation  of  the  objective 
and  constraint  functions. 


4.5.1  Monomials  and  posynomials 

A  function  /  :  R™  — >  R  with  dom  /  =  R"  + ,  defined  as 

/Or)  =cx11x%2  ■■■xann,  (4.41) 

where  c  >  0  and  £  R,  is  called  a  monomial  function,  or  simply,  a  monomial. 
The  exponents  of  a  monomial  can  be  any  real  numbers,  including  fractional  or 
negative,  but  the  coefficient  c  can  only  be  positive.  (The  term  ‘monomial’  conflicts 
with  the  standard  definition  from  algebra,  in  which  the  exponents  must  be  non¬ 
negative  integers,  but  this  should  not  cause  any  confusion.)  A  sum  of  monomials, 
i.e.,  a  function  of  the  form 


K 

f(x )  =  Ckxllk x?k  •  •  •  xn"k  )  (4-42) 

k= 1 

where  Ck  >  0,  is  called  a  posynomial  function  (with  I\  terms),  or  simply,  a  posyn- 
omial. 
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Posynomials  are  closed  under  addition,  multiplication,  and  nonnegative  scal¬ 
ing.  Monomials  are  closed  under  multiplication  and  division.  If  a  posynomial  is 
multiplied  by  a  monomial,  the  result  is  a  posynomial;  similarly,  a  posynomial  can 
be  divided  by  a  monomial,  with  the  result  a  posynomial. 


4.5.2  Geometric  programming 

An  optimization  problem  of  the  form 

minimize  /o  ( x ) 

subject  to  fi(x)  <  1,  i  =  1, . . . ,  m  (4.43) 

hi(x)  =  1,  i  =  l,...,p 

where  /o,  •  •  ■ ,  fm  are  posynomials  and  hi, ...  ,hp  are  monomials,  is  called  a  geomet¬ 
ric  program  (GP).  The  domain  of  this  problem  is  T>  =  R”  ,  ;  the  constraint  x  >-  0 
is  implicit. 

Extensions  of  geometric  programming 

Several  extensions  are  readily  handled.  If  /  is  a  posynomial  and  h  is  a  monomial, 
then  the  constraint  f{x)  <  h{x)  can  be  handled  by  expressing  it  as  f(x)/h(x )  <  1 
(since  f/h  is  posynomial).  This  includes  as  a  special  case  a  constraint  of  the 
form  f{x)  <  a,  where  /  is  posynomial  and  a  >  0.  In  a  similar  way  if  hi  and  h 2 
are  both  nonzero  monomial  functions,  then  we  can  handle  the  equality  constraint 
hi(x)  =  h2(x)  by  expressing  it  as  hi  (x)//i2  (x)  =  1  (since  /11//12  is  monomial).  We 
can  maximize  a  nonzero  monomial  objective  function,  by  minimizing  its  inverse 
(which  is  also  a  monomial). 

For  example,  consider  the  problem 

maximize  x/y 
subject  to  2  <  x  <  3 

x 2  +  3 y/z  <  y/y 
x/y  =  z2, 

with  variables  x,  y,  z  €  R  (and  the  implicit  constraint  x,  y,  z  >  0).  Using 
the  simple  transformations  described  above,  we  obtain  the  equivalent  standard 
form  GP 

minimize  x~ly 

subject  to  2a;-1  <  1,  (l/3)x  <  1 

x2y~1/2  +  3  y1/2z~1  <  1 
xy^xz^2  =  1. 

We  will  refer  to  a  problem  like  this  one,  that  is  easily  transformed  to  an  equiva¬ 
lent  GP  in  the  standard  form  (4.43),  also  as  a  GP.  (In  the  same  way  that  we  refer 
to  a  problem  easily  transformed  to  an  LP  as  an  LP.) 
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4.5.3  Geometric  program  in  convex  form 

Geometric  programs  are  not  (in  general)  convex  optimization  problems,  but  they 
can  be  transformed  to  convex  problems  by  a  change  of  variables  and  a  transforma¬ 
tion  of  the  objective  and  constraint  functions. 

We  will  use  the  variables  defined  as  yi  =  log  Xi,  so  Xi  =  eVi .  If  /  is  the  monomial 
function  of  x  given  in  (4.41),  i.e., 

f{x)  =  cx\'xaJ  ■■■xln, 

then 

f{x)  =  f(eVl  , . . . ,  eVn) 

=  c(eVl)ai  ■  ■  ■  (ey”)°n 

—  paTV+b 

where  b  =  log  c.  The  change  of  variables  yi  =  log  x^  turns  a  monomial  function 
into  the  exponential  of  an  affine  function. 

Similarly,  if  /  is  the  posynomial  given  by  (4.42),  i.e., 

K 

f(x )  =  '^2,ckxllkx 22k  ■■■xlnk, 

k=l 


then 


K 


/(*)  = 


k— 1 


e  alv+bk^ 


where  ak  =  (an-, . . .  ,ank)  and  bk  =  logc*,.  After  the  change  of  variables,  a  posyn¬ 
omial  becomes  a  sum  of  exponentials  of  affine  functions. 

The  geometric  program  (4.43)  can  be  expressed  in  terms  of  the  new  variable  y 
as 

minimize  ^2k=  l  ea°kV+bok 

subject  to  Ylk= l  ea^kV+bik  <1,  i  =  1, . . . ,  m 

egfy+hi  =  l ;  i  = 


where  atk  €  R",  i  =  0, . . . ,  m,  contain  the  exponents  of  the  posynomial  inequality 
constraints,  and  gi  £  R",  i  =  1  contain  the  exponents  of  the  monomial 

equality  constraints  of  the  original  geometric  program. 

Now  we  transform  the  objective  and  constraint  functions,  by  taking  the  loga¬ 
rithm.  This  results  in  the  problem 


minimize  fo{y)  =  log  (XOaSi  ea°kV+bokSj 

subject  to  fi{y)  =  log  ( 'Ylk=i  ea^kV+bik^j  <0,  i  =  1, . . . ,  m  (4.44) 

hi(y)  =  gfy  +  hi  =  0,  i  =  l,...,p. 


Since  the  functions  /,  are  convex,  and  hi  are  affine,  this  problem  is  a  convex 
optimization  problem.  We  refer  to  it  as  a  geometric  program  in  convex  form.  To 
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distinguish  it  from  the  original  geometric  program,  we  refer  to  (4.43)  as  a  geometric 
program  in  posynomial  form. 

Note  that  the  transformation  between  the  posynomial  form  geometric  pro¬ 
gram  (4.43)  and  the  convex  form  geometric  program  (4.44)  does  not  involve  any 
computation;  the  problem  data  for  the  two  problems  are  the  same.  It  simply 
changes  the  form  of  the  objective  and  constraint  functions. 

If  the  posynomial  objective  and  constraint  functions  all  have  only  one  term, 
i.e.,  are  monomials,  then  the  convex  form  geometric  program  (4.44)  reduces  to  a 
(general)  linear  program.  We  can  therefore  consider  geometric  programming  to  be 
a  generalization,  or  extension,  of  linear  programming. 


4.5.4  Examples 

Frobenius  norm  diagonal  scaling 

Consider  a  matrix  M  £  R"x",  and  the  associated  linear  function  that  maps  u 
into  y  =  Mu.  Suppose  we  scale  the  coordinates,  i.e.,  change  variables  to  u  =  Du, 
y  =  Dy,  where  D  is  diagonal,  with  Da  >  0.  In  the  new  coordinates  the  linear 
function  is  given  by  y  =  DMD~1:a. 

Now  suppose  we  want  to  choose  the  scaling  in  such  a  way  that  the  resulting 
matrix,  DMD~l,  is  small.  We  will  use  the  Frobenius  norm  (squared)  to  measure 
the  size  of  the  matrix: 


\\DMD~1 1|| 


tr  (k[DMD~1)T  ( DMD _1)) 

n 

E  (DMD~% 

i,j= 1 

n 

E  M?A/dl 

i,j= 1 


where  D  =  diag(d).  Since  this  is  a  posynomial  in  d,  the  problem  of  choosing  the 
scaling  d  to  minimize  the  Frobenius  norm  is  an  unconstrained  geometric  program, 

minimize  £"j=1  Mfd^uF. 

with  variable  d.  The  only  exponents  in  this  geometric  program  are  0,  2,  and  —2. 


Design  of  a  cantilever  beam 

We  consider  the  design  of  a  cantilever  beam,  which  consists  of  N  segments,  num¬ 
bered  from  right  to  left  as  1, . . . ,  IV,  as  shown  in  figure  4.6.  Each  segment  has  unit 
length  and  a  uniform  rectangular  cross-section  with  width  wl  and  height  hi.  A 
vertical  load  (force)  F  is  applied  at  the  right  end  of  the  beam.  This  load  causes 
the  beam  to  deflect  (downward),  and  induces  stress  in  each  segment  of  the  beam. 
We  assume  that  the  deflections  are  small,  and  that  the  material  is  linearly  elastic, 
with  Young’s  modulus  E. 
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Figure  4.6  Segmented  cantilever  beam  with  4  segments.  Each  segment  has 
unit  length  and  a  rectangular  profile.  A  vertical  force  F  is  applied  at  the 
right  end  of  the  beam. 


The  design  variables  in  the  problem  are  the  widths  m;  and  heights  hi  of  the  N 
segments.  We  seek  to  minimize  the  total  volume  of  the  beam  (which  is  proportional 
to  its  weight), 

w\hi  H - +  wnHn , 

subject  to  some  design  constraints.  We  impose  upper  and  lower  bounds  on  width 
and  height  of  the  segments, 

tr7min  V-h  ^  rCmax ,  ^rnin  A  hj  A  hmax ,  i  =  1 ,  .  .  .  ,  ./V, 
as  well  as  the  aspect  ratios, 

*Smin  hi/Wi  ^ 

In  addition,  we  have  a  limit  on  the  maximum  allowable  stress  in  the  material,  and 
on  the  vertical  deflection  at  the  end  of  the  beam. 

We  first  consider  the  maximum  stress  constraint.  The  maximum  stress  in  seg¬ 
ment  i,  which  we  denote  cq,  is  given  by  ct;  =  6 iF / ( Wih f ).  We  impose  the  constraints 

6  iF 

7~2  —  ^maxi  ^  —  1 

Wihf 

to  ensure  that  the  stress  does  not  exceed  the  maximum  allowable  value  tTmax  any¬ 
where  in  the  beam. 

The  last  constraint  is  a  limit  on  the  vertical  deflection  at  the  end  of  the  beam, 
which  we  will  denote  y\ : 

Vl  —  Umax- 

The  deflection  y\  can  be  found  by  a  recursion  that  involves  the  deflection  and  slope 
of  the  beam  segments: 

F  F 

=  12(i  -  1/2)-^— p  +  ni+i,  yi  =  6 (i  -  1/3) Ew  ^  +  vi+\  +  yi+ 1,  (4.45) 

for  i  =  N,  N  —  1, . . . ,  1,  with  starting  values  vjy+i  =  J/at+i  =  0.  In  this  recursion, 
yi  is  the  deflection  at  the  right  end  of  segment  i,  and  i>j  is  the  slope  at  that  point. 
We  can  use  the  recursion  (4.45)  to  show  that  these  deflection  and  slope  quantities 
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are  in  fact  posynomial  functions  of  the  variables  w  and  h.  We  first  note  that  t'yv+i 
and  un+i  are  zero,  and  therefore  posynomials.  Now  assume  that  and  y%+i  are 
posynomial  functions  of  w  and  h.  The  lefthand  equation  in  (4.45)  shows  that  i>,;  is 
the  sum  of  a  monomial  and  a  posynomial  (i.e.,  Uj+i),  and  therefore  is  a  posynomial. 
From  the  righthand  equation  in  (4.45),  we  see  that  the  deflection  y,  is  the  sum  of 
a  monomial  and  two  posynomials  (vi+\  and  t/i+i),  and  so  is  a  posynomial.  In 
particular,  the  deflection  at  the  end  of  the  beam,  yi,  is  a  posynomial. 

The  problem  is  then 

*  ’  *  7 

minimize  2_^=i  wi’li 

subject  to  Wmi„  <  Wi  <  W max,  i  =  1,  .  .  .  ,  N 

hmin  —  hi  —  hmax,  i  1,  •  ■  •  ,  hi  46) 

Amin  ^  hi/ Wi  ^  ‘S'max >  i  =  1 ,  .  .  . ,  N 
6 iF/(wih?)  <  trmax,  i  =  l,...,N 

Ul  2/max  > 

with  variables  w  and  h.  This  is  a  GP,  since  the  objective  is  a  posynomial,  and 
the  constraints  can  all  be  expressed  as  posynomial  inequalities.  (In  fact,  the  con¬ 
straints  can  be  all  be  expressed  as  monomial  inequalities,  with  the  exception  of  the 
deflection  limit,  which  is  a  complicated  posynomial  inequality.) 

When  the  number  of  segments  N  is  large,  the  number  of  monomial  terms  ap¬ 
pearing  in  the  posynomial  j/i  grows  approximately  as  N2 .  Another  formulation  of 
this  problem,  explored  in  exercise  4.31,  is  obtained  by  introducing  Vi,...  ,Vn  and 
2/1 , ... ,  2/iv  as  variables,  and  including  a  modified  version  of  the  recursion  as  a  set 
of  constraints.  This  formulation  avoids  this  growth  in  the  number  of  monomial 
terms. 

Minimizing  spectral  radius  via  Perron-Frobenius  theory 

Suppose  the  matrix  A  £  R”*™  is  elementwise  nonnegative,  i.e.,  A,j  >  0  for  i,j  = 
1, . . . ,  n,  and  irreducible,  which  means  that  the  matrix  (/  +  A )n^1  is  elementwise 
positive.  The  Perron-Frobenius  theorem  states  that  A  has  a  positive  real  eigenvalue 
Apf  equal  to  its  spectral  radius,  i.e.,  the  largest  magnitude  of  its  eigenvalues.  The 
Perron-Frobenius  eigenvalue  Apf  determines  the  asymptotic  rate  of  growth  or  decay 
of  Ak,  as  k  — >  oo;  in  fact,  the  matrix  ((l/Apf)A)fc  converges.  Roughly  speaking, 
this  means  that  as  k  — >  oo,  Ak  grows  like  Apf,  if  Apf  >  1,  or  decays  like  Apf,  if 
Apf  <  1. 

A  basic  result  in  the  theory  of  nonnegative  matrices  states  that  the  Perron- 
Frobenius  eigenvalue  is  given  by 

Apf  =  inf  {A  |  Av  A  \v  for  some  v  >-  0} 

(and  moreover,  that  the  infimum  is  achieved).  The  inequality  Av  A  Av  can  be 
expressed  as 

n 

y  AijVj/(Xvi)  <1,  i  =  l, ...  ,n,  (4.47) 

i= i 

which  is  a  set  of  posynomial  inequalities  in  the  variables  Ay,  v ,,  and  A.  Thus, 
the  condition  that  Apf  <  A  can  be  expressed  as  a  set  of  posynomial  inequalities 
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in  A ,  v,  and  A.  This  allows  us  to  solve  some  optimization  problems  involving  the 
Perron-Frobenius  eigenvalue  using  geometric  programming. 

Suppose  that  the  entries  of  the  matrix  A  are  posynomial  functions  of  some 
underlying  variable  x  £  Rfc.  In  this  case  the  inequalities  (4.47)  are  posynomial 
inequalities  in  the  variables  x  £  Rfe,  v  £  Rn,  and  A  £  R.  We  consider  the  problem 
of  choosing  x  to  minimize  the  Perron-Frobenius  eigenvalue  (or  spectral  radius)  of 
A ,  possibly  subject  to  posynomial  inequalities  on  x, 

minimize  Apf(A(;r)) 

subject  to  fi(x)  <1,  i  =  1, . . .  ,p, 

where  /)  are  posynomials.  Using  the  characterization  above,  we  can  express  this 
problem  as  the  GP 


minimize  A 

subject  to  )Cj=i  AijVj / (A Vi)  <1,  i  =  1, . . . ,  n 
fi{x)<  1,  i  = 

where  the  variables  are  x,  v,  and  A. 

As  a  specific  example,  we  consider  a  simple  model  for  the  population  dynamics 
for  a  bacterium,  with  time  or  period  denoted  by  t  =  0, 1, 2, . . .,  in  hours.  The  vector 
p(t)  £  R+  characterizes  the  population  age  distribution  at  period  t :  pi(t)  is  the 
total  population  between  0  and  1  hours  old;  p2(t)  is  the  total  population  between 
1  and  2  hours  old;  and  so  on.  We  (arbitrarily)  assume  that  no  bacteria  live  more 
than  4  hours.  The  population  propagates  in  time  as  p(t  +  1)  =  Ap(t),  where 


bi  b2  b3  b4 

si  0  0  0 

0  s2  0  0 

0  0  s3  0 


Here  bi  is  the  birth  rate  among  bacteria  in  age  group  i,  and  s.j  is  the  survival  rate 
from  age  group  i  into  age  group  i  +  1.  We  assume  that  bi  >  0  and  0  <  s*  <  1, 
which  implies  that  the  matrix  A  is  irreducible. 

The  Perron-Frobenius  eigenvalue  of  A  determines  the  asymptotic  growth  or 
decay  rate  of  the  population.  If  Apf  <  1,  the  population  converges  to  zero  like 
Apf,  and  so  has  a  half-life  of  —  l/log2  Apf  hours.  If  Apf  >  1  the  population  grows 
geometrically  like  Apf,  with  a  doubling  time  of  l/log2Apf  hours.  Minimizing  the 
spectral  radius  of  A  corresponds  to  finding  the  fastest  decay  rate,  or  slowest  growth 
rate,  for  the  population. 

As  our  underlying  variables,  on  which  the  matrix  A  depends,  we  take  Ci  and  c2, 
the  concentrations  of  two  chemicals  in  the  environment  that  affect  the  birth  and 
survival  rates  of  the  bacteria.  We  model  the  birth  and  survival  rates  as  monomial 
functions  of  the  two  concentrations: 


bi  =  6rm(ci/crm)ai(c2/c5om)/3  S  i  =  l,...,4 
=  cm(ci/crm)7i(c2/crm)5s  *  =  i . 3. 


Here,  6"om  is  nominal  birth  rate,  sfom  is  nominal  survival  rate,  and  cfom  is  nominal 
concentration  of  chemical  i.  The  constants  cq,  /3*,  7$,  and  Si  give  the  effect  on  the 
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birth  and  survival  rates  due  to  changes  in  the  concentrations  of  the  chemicals  away 
from  the  nominal  values.  For  example  0.2  =  —0.3  and  71  =  0.5  means  that  an 
increase  in  concentration  of  chemical  1,  over  the  nominal  concentration,  causes  a 
decrease  in  the  birth  rate  of  bacteria  that  are  between  1  and  2  hours  old,  and  an 
increase  in  the  survival  rate  of  bacteria  from  0  to  1  hours  old. 

We  assume  that  the  concentrations  c\  and  C2  can  be  independently  increased  or 
decreased  (say,  within  a  factor  of  2),  by  administering  drugs,  and  pose  the  problem 
of  finding  the  drug  mix  that  maximizes  the  population  decay  rate  (be.,  minimizes 
Apf(A)).  Using  the  approach  described  above,  this  problem  can  be  posed  as  the 
GP 

minimize  A 

subject  to  biVi  +  62^2  +  ^3^3  +  <  Ai>i 

smi  <  Xv2 

S2V2  <  Xl’3 

S3V3  <  Arq 

1/2  <  Ci/crm  <  2,  *  =  1,2 

h  =  6"om(ci/c5lom)ai(c2/c5om)/3S  *  =  1 . 4 

s,:  =  s"om(ci/crm)7i(c2/cSom)'5i,  *  =  1, ....  3, 

with  variables  &*,  Si,  Ci,  Vi,  and  A. 


4.6  Generalized  inequality  constraints 

One  very  useful  generalization  of  the  standard  form  convex  optimization  prob¬ 
lem  (4.15)  is  obtained  by  allowing  the  inequality  constraint  functions  to  be  vector 
valued,  and  using  generalized  inequalities  in  the  constraints: 

minimize  fo(%) 

subject  to  fi(x)  AK.  0,  *  =  1, . . . ,  m  (4.48) 

Ax  =  b , 

where  fo  :  Rra  — >  R,  Aj  C  Rfci  are  proper  cones,  and  fi  :  R"  — >  Rfci  are  /^-convex. 
We  refer  to  this  problem  as  a  (standard  form)  convex  optimization  problem  with 
generalized  inequality  constraints.  Problem  (4.15)  is  a  special  case  with  A”)  =  R+, 
i  =  1, . . . ,  m. 

Many  of  the  results  for  ordinary  convex  optimization  problems  hold  for  problems 
with  generalized  inequalities.  Some  examples  are: 

•  The  feasible  set,  any  sublevel  set,  and  the  optimal  set  are  convex. 

•  Any  point  that  is  locally  optimal  for  the  problem  (4.48)  is  globally  optimal. 

•  The  optimality  condition  for  differentiable  fo,  given  in  §4.2.3,  holds  without 
any  change. 

We  will  also  see  (in  chapter  11)  that  convex  optimization  problems  with  generalized 
inequality  constraints  can  often  be  solved  as  easily  as  ordinary  convex  optimization 
problems. 
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4.6.1  Conic  form  problems 

Among  the  simplest  convex  optimization  problems  with  generalized  inequalities  are 
the  conic  form  problems  (or  cone  programs ),  which  have  a  linear  objective  and  one 
inequality  constraint  function,  which  is  affine  (and  therefore  AT-convex): 

minimize  cTx 

subject  to  Fx  +  g  FK  0  (4.49) 

Ax  =  b. 

When  K  is  the  nonnegative  orthant,  the  conic  form  problem  reduces  to  a  linear 
program.  We  can  view  conic  form  problems  as  a  generalization  of  linear  programs 
in  which  componentwise  inequality  is  replaced  with  a  generalized  linear  inequality. 

Continuing  the  analogy  to  linear  programming,  we  refer  to  the  conic  form  prob¬ 
lem 

minimize  cTx 
subject  to  x  hK  0 
Ax  =  b 

as  a  conic  form  problem  in  standard  form.  Similarly,  the  problem 

minimize  cTx 
subject  to  Fx  +  g  Ak  0 

is  called  a  conic  form  problem  in  inequality  form. 


4.6.2  Semidefinite  programming 

When  K  is  S+,  the  cone  of  positive  semidefinite  k  x  k  matrices,  the  associated 
conic  form  problem  is  called  a  semidefinite  program  (SDP),  and  has  the  form 

minimize  cT  x 

subject  to  XlFx  H - +  xnFn  +  G  A  0  (4.50) 

Ax  =  b, 

where  G,  f*j , . . .  ,Fn  £  Sfe,  and  A  £  Rpxn.  The  inequality  here  is  a  linear  matrix 
inequality  (see  example  2.10). 

If  the  matrices  G,  F\, ... ,  Fn  are  all  diagonal,  then  the  LMI  in  (4.50)  is  equiva¬ 
lent  to  a  set  of  n  linear  inequalities,  and  the  SDP  (4.50)  reduces  to  a  linear  program. 

Standard  and  inequality  form  semidefinite  programs 

Following  the  analogy  to  LP,  a  standard  form  SDP  has  linear  equality  constraints, 
and  a  (matrix)  nonnegativity  constraint  on  the  variable  X  £  S": 

minimize  tr(C'X) 

subject  to  tr(AiX)  =  bi,  i  =  l,...,p 
XFO, 


(4.51) 
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where  C,  A\, . . . ,  Ap  £  S".  (Recall  that  tr (CX)  =  Y^ij= 1  is  the  form  of  a 

general  real- valued  linear  function  on  S™.)  This  form  should  be  compared  to  the 
standard  form  linear  program  (4.28).  In  LP  and  SDP  standard  forms,  we  minimize 
a  linear  function  of  the  variable,  subject  to  p  linear  equality  constraints  on  the 
variable,  and  a  nonnegativity  constraint  on  the  variable. 

An  inequality  form  SDP ,  analogous  to  an  inequality  form  LP  (4.29),  has  no 
equality  constraints,  and  one  LMI: 

minimize  cTx 

subject  to  x\Ai  +  •  •  •  +  xnAn  A  B, 
with  variable  x  £  Rn,  and  parameters  B ,  Ai, . . . ,  An  £  Sfc,  c  £  R". 

Multiple  LMIs  and  linear  inequalities 

It  is  common  to  refer  to  a  problem  with  linear  objective,  linear  equality  and  in¬ 
equality  constraints,  and  several  LMI  constraints,  i.e., 

minimize  cTx 

subject  to  F^\x)  =  xiF^  -I - +  xnFn  ^  +  G W  ^  0,  i  =  1, . . . ,  K 

Gx  A  h,  Ax  =  b , 

as  an  SDP  as  well.  Such  problems  are  readily  transformed  to  an  SDP,  by  forming 
a  large  block  diagonal  LMI  from  the  individual  LMIs  and  linear  inequalities: 

minimize  cTx 

subject  to  diag(Gcc  —  h,  F^\x), . . . ,  F^^x))  A  0 
Ax  =  b. 


4.6.3  Examples 


Second-order  cone  programming 

The  SOCP  (4.36)  can  be  expressed  as  a  conic  form  problem 


minimize  cTx 

subject  to  —(AiX  +  bi,cfx  +  di)AKi  0,  i  =  l,...,m 
Fx  =  g , 


in  which 

Ki  =  {(y,t)  £  R"i+1  I  IMI2  <  t}, 

i.e.,  the  second-order  cone  in  Rn’+  .  This  explains  the  name  second-order  cone 
program  for  the  optimization  problem  (4.36). 


Matrix  norm  minimization 

Let  A(x)  =  A0  +  X\Ai  +  ■  ■  ■  +  xnAn ,  where  At  £  RpX9.  We  consider  the  uncon¬ 
strained  problem 


minimize  ||A(x)||2, 
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where  ||  •  || 2  denotes  the  spectral  norm  (maximum  singular  value),  and  x  £  Rn  is 
the  variable.  This  is  a  convex  problem  since  ||A(:r)||2  is  a  convex  function  of  x. 

Using  the  fact  that  ||A||2  <  s  if  and  only  if  AT A  A  s2I  (and  s  >  0),  we  can 
express  the  problem  in  the  form 

minimize  s 

subject  to  A(x)T A(x)  A  si, 

with  variables  x  and  s.  Since  the  function  A(x)T A(x)  —  si  is  matrix  convex  in 
(x,  s),  this  is  a  convex  optimization  problem  with  a  single  q  x  q  matrix  inequality 
constraint. 

We  can  also  formulate  the  problem  using  a  single  linear  matrix  inequality  of 
size  (p  +  q)  x  (p  +  q),  using  the  fact  that 

AtA  A  t2I  (and  t  >  0)  ^  y  0. 

(see  §A.5.5).  This  results  in  the  SDP 
minimize  t 

i .  , ,  r  n  a(x)  1 

subject  to  ^  a{x)T  j  ^  0 

in  the  variables  x  and  t. 

Moment  problems 

Let  t  be  a  random  variable  in  R.  The  expected  values  E  tk  (assuming  they  exist) 
are  called  the  (power)  moments  of  the  distribution  of  t.  The  following  classical 
results  give  a  characterization  of  a  moment  sequence. 

If  there  is  a  probability  distribution  on  R  such  that  Xk  =  E  tk,  k  =  0, . . .  ,2 n, 
then  xo  =  1  and 

X\  X2  •  •  •  Xn—  1  Xfi 

X2  x%  .  .  .  Xn  Xn+i 

X3  X4  .  .  .  Xnjt\  Xn+2 

y  0.  (4.52) 

I  Xn—  1  Xn  Xn-\-i  .  .  .  X2n—2  X2n—1 

|_  Xn  XnAr\  Xn_|_2  •  •  •  X2n—  1  *^2  n 

(The  matrix  H  is  called  the  Hankel  matrix  associated  with  xq,  . . .  ,X2 n-)  This  is 
easy  to  see:  Let  Xi  =  Et%  i  =  0, . . . ,  2n  be  the  moments  of  some  distribution,  and 
let  y  =  (2/0,  yi, . . .  yn)  €  R”+1.  Then  we  have 

n 

yTH(x 0, . . . ,  x2n)y  =  ^2  Viyi E  tl+J  =  E(yo  +  yifl - h  yntn)2  >  o. 

i,j= 0 

The  following  partial  converse  is  less  obvious:  If  Xo  =  1  and  H(x)  y  0,  then  there 
exists  a  probability  distribution  on  R  such  that  Xj  =  Et',  i  =  0,...  ,2 n.  (For  a 


x0 

Xl 

x2 

H{xq,  .  .  .  ,X2n)  = 
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proof,  see  exercise  2.37.)  Now  suppose  that  xq  =  1,  and  H(x)  y  0  (but  possibly 
H(x)  0),  i.e.,  the  linear  matrix  inequality  (4.52)  holds,  but  possibly  not  strictly. 
In  this  case,  there  is  a  sequence  of  distributions  on  R,  whose  moments  converge  to 
x.  In  summary:  the  condition  that  Xq,  ... ,  X2n  be  the  moments  of  some  distribution 
on  R  (or  the  limit  of  the  moments  of  a  sequence  of  distributions)  can  be  expressed 
as  the  linear  matrix  inequality  (4.52)  in  the  variable  x,  together  with  the  linear 
equality  Xo  =  1.  Using  this  fact,  we  can  cast  some  interesting  problems  involving 
moments  as  SDPs. 

Suppose  t  is  a  random  variable  on  R.  We  do  not  know  its  distribution,  but  we 
do  know  some  bounds  on  the  moments,  i.e., 

fj.k<Etk  <-pk,  k  =  1, . . . ,  2n 

(which  includes,  as  a  special  case,  knowing  exact  values  of  some  of  the  moments). 
Let  p(t)  =  Co  +  C\t  +  •  •  •  +  C2nt2n  be  a  given  polynomial  in  t.  The  expected  value 
of  p(t)  is  linear  in  the  moments  E  tl: 

2  n  2  n 

e  p(t)  =  Cj  e  f  =  y ^axj. 

i— 0  i— 0 

We  can  compute  upper  and  lower  bounds  for  E  p(t), 
minimize  (maximize  )  E  p(t) 

subject  to  n  <F,tk  <pk,  k  =  1, . . . ,  2  n, 

over  all  probability  distributions  that  satisfy  the  given  moment  bounds,  by  solving 
the  SDP 

minimize  (maximize)  c\X\  +  ■  ■  •  +  C2nX2n 
subject  to  A*fc  <  xk  <  Jik,  k  =  l, ... ,  2 n 

H(l,  X\,  .  .  .,X2n)  h  0 

with  variables  xi,  . .  ■ ,  X2n-  This  gives  bounds  on  E p(t),  over  all  probability  dis¬ 
tributions  that  satisfy  the  known  moment  constraints.  The  bounds  are  sharp  in 
the  sense  that  there  exists  a  sequence  of  distributions,  whose  moments  satisfy  the 
given  moment  bounds,  for  which  E p(t)  converges  to  the  upper  and  lower  bounds 
found  by  these  SDPs. 

Bounding  portfolio  risk  with  incomplete  covariance  information 

We  consider  once  again  the  setup  for  the  classical  Markowitz  portfolio  problem  (see 
page  155).  We  have  a  portfolio  of  n  assets  or  stocks,  with  Xi  denoting  the  amount 
of  asset  i  that  is  held  over  some  investment  period,  and  pi  denoting  the  relative 
price  change  of  asset  i  over  the  period.  The  change  in  total  value  of  the  portfolio 
is  pTx.  The  price  change  vector  p  is  modeled  as  a  random  vector,  with  mean  and 
covariance 

p  =  Ep,  £  =  E(p-p)(p-p)T. 

The  change  in  value  of  the  portfolio  is  therefore  a  random  variable  with  mean  pT x 
and  standard  deviation  a  =  (xTT, x)1^2.  The  risk  of  a  large  loss,  i.e.,  a  change 
in  portfolio  value  that  is  substantially  below  its  expected  value,  is  directly  related 
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to  the  standard  deviation  a,  and  increases  with  it.  For  this  reason  the  standard 
deviation  a  (or  the  variance  cr2)  is  used  as  a  measure  of  the  risk  associated  with 
the  portfolio. 

In  the  classical  portfolio  optimization  problem,  the  portfolio  x  is  the  optimiza¬ 
tion  variable,  and  we  minimize  the  risk  subject  to  a  minimum  mean  return  and 
other  constraints.  The  price  change  statistics  p  and  E  are  known  problem  param¬ 
eters.  In  the  risk  bounding  problem  considered  here,  we  turn  the  problem  around: 
we  assume  the  portfolio  x  is  known,  but  only  partial  information  is  available  about 
the  covariance  matrix  E.  We  might  have,  for  example,  an  upper  and  lower  bound 
on  each  entry: 

Ljj  ^  ^  TJ i j,  i,  j  =  1, . . . ,  n, 

where  L  and  U  are  given.  We  now  pose  the  question:  what  is  the  maximum  risk 
for  our  portfolio,  over  all  covariance  matrices  consistent  with  the  given  bounds? 
We  define  the  worst-case  variance  of  the  portfolio  as 

alc  =  sup{:rTEa;  |  Ltj  <  Ey  <  Utj,  i,j  =  1, . . .  ,n,  E  y  0}. 

We  have  added  the  condition  E  X  0,  which  the  covariance  matrix  must,  of  course, 
satisfy. 

We  can  find  erwc  by  solving  the  SDP 
maximize  xTY,x 

subject  to  Lij  <  Ejj  <  Uij,  i,  j  =  1, . . . ,  n 
E  ^  0 

with  variable  E  £  S"  (and  problem  parameters  x,  L,  and  U ).  The  optimal  E  is 
the  worst  covariance  matrix  consistent  with  our  given  bounds  on  the  entries,  where 
‘worst’  means  largest  risk  with  the  (given)  portfolio  x.  We  can  easily  construct 
a  distribution  for  p  that  is  consistent  with  the  given  bounds,  and  achieves  the 
worst-case  variance,  from  an  optimal  E  for  the  SDP.  For  example,  we  can  take 
p  =  p  +  E1/2?;,  where  v  is  any  random  vector  with  Er  =  0  and  EvvT  =  I. 

Evidently  we  can  use  the  same  method  to  determine  ctwc  for  any  prior  informa¬ 
tion  about  E  that  is  convex.  We  list  here  some  examples. 

•  Known  variance  of  certain  portfolios.  We  might  have  equality  constraints 
such  as 

Sufc  =  cr2. , 

where  Uk  and  oy.  are  given.  This  corresponds  to  prior  knowledge  that  certain 
known  portfolios  (given  by  Uk)  have  known  (or  very  accurately  estimated) 
variance. 

•  Including  effects  of  estimation  error.  If  the  covariance  E  is  estimated  from 
empirical  data,  the  estimation  method  will  give  an  estimate  S,  and  some  in¬ 
formation  about  the  reliability  of  the  estimate,  such  as  a  confidence  ellipsoid. 
This  can  be  expressed  as 

C(E  —  E)  <  a, 

where  C  is  a  positive  definite  quadratic  form  on  S”,  and  the  constant  a 
determines  the  confidence  level. 
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•  Factor  models.  The  covariance  might  have  the  form 

E  =  FHfactOIFT  +  D, 

where  F  €  R”xfe,  Efactor  G  S  ,  and  D  is  diagonal.  This  corresponds  to  a 
model  of  the  price  changes  of  the  form 

p  =  Fz  +  d, 

where  z  is  a  random  variable  (the  underlying  factors  that  affect  the  price 
changes)  and  d,  are  independent  (additional  volatility  of  each  asset  price). 
We  assume  that  the  factors  are  known.  Since  E  is  linearly  related  to  Efact0r 
and  D ,  we  can  impose  any  convex  constraint  on  them  (representing  prior 
information)  and  still  compute  trwc  using  convex  optimization. 

•  Information  about  correlation  coefficients.  In  the  simplest  case,  the  diagonal 
entries  of  E  (i.e.,  the  volatilities  of  each  asset  price)  are  known,  and  bounds 
on  correlation  coefficients  between  price  changes  are  known: 

E  ■  ■ 

hj  <  Pij  =  i/2„l/2  —  Ui3  ’  *»  j  =  1,  -  -  -  ,  Ti- 

^ a  ^jj 

Since  Tm  are  known,  but  E^  for  i  ^  j  are  not,  these  are  linear  inequalities. 

Fastest  mixing  Markov  chain  on  a  graph 

We  consider  an  undirected  graph,  with  nodes  1, . . . ,  n,  and  a  set  of  edges 

£  C  {l,...,n}  x  {l,...,n}. 

Here  (i,j)  G  £  means  that  nodes  i  and  j  are  connected  by  an  edge.  Since  the 
graph  is  undirected,  £  is  symmetric:  (i,j)  €  £  if  and  only  if  (j,i)  G  £.  We  allow 
the  possibility  of  self-loops,  i.e.,  we  can  have  (i,i)  €  £. 

We  define  a  Markov  chain,  with  state  X(t)  G  {1, . . .  ,n},  for  t  G  Z+  (the  set 
of  nonnegative  integers),  as  follows.  With  each  edge  (i,j)  £  £  we  associate  a 
probability  P \j ,  which  is  the  probability  that  X  makes  a  transition  between  nodes 
i  and  j.  State  transitions  can  only  occur  across  edges;  we  have  Pij  =  0  for  (i,j)  $.  £■ 
The  probabilities  associated  with  the  edges  must  be  nonnegative,  and  for  each  node, 
the  sum  of  the  probabilities  of  links  connected  to  the  node  (including  a  self-loop, 
if  there  is  one)  must  equal  one. 

The  Markov  chain  has  transition  probability  matrix 

Pij  =  prob(X(t  +  1)  =  i  |  X(t)  =  j),  i,j  =  1  ,...,n. 

This  matrix  must  satisfy 

Pij  >0,  i,  j  =  1 . n,  1 TP  =  1T,  P  =  PT ,  (4.53) 


and  also 


Pij  =  0  for  (i,j)g£. 


(4.54) 
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Since  P  is  symmetric  and  1 T P  =  1T ,  we  conclude  PI  =  1,  so  the  uniform 
distribution  (l/n)l  is  an  equilibrium  distribution  for  the  Markov  chain.  Conver¬ 
gence  of  the  distribution  of  X(t)  to  (l/n)l  is  determined  by  the  second  largest  (in 
magnitude)  eigenvalue  of  P,  i.e.,  by  r  =  max{A2,  —  A„},  where 

1  =  Al  A2  ^  *  *  *  ^  A n 

are  the  eigenvalues  of  P.  We  refer  to  r  as  the  mixing  rate  of  the  Markov  chain. 
If  r  =  1,  then  the  distribution  of  X(t)  need  not  converge  to  (l/n)l  (which  means 
the  Markov  chain  does  not  mix).  When  r  <  1,  the  distribution  of  X(t)  approaches 
(l/n)l  asymptotically  as  r4,  as  t  — >  00.  Thus,  the  smaller  r  is,  the  faster  the 
Markov  chain  mixes. 

The  fastest  mixing  Markov  chain  problem  is  to  find  P,  subject  to  the  con¬ 
straints  (4.53)  and  (4.54),  that  minimizes  r.  (The  problem  data  is  the  graph,  i.e., 
£.)  We  will  show  that  this  problem  can  be  formulated  as  an  SDP. 

Since  the  eigenvalue  Ai  =  1  is  associated  with  the  eigenvector  1,  we  can  express 
the  mixing  rate  as  the  norm  of  the  matrix  P,  restricted  to  the  subspace  1  r  = 
HQPQH2,  where  Q  =  I—  (l/n)llr  is  the  matrix  representing  orthogonal  projection 
on  lU  Using  the  property  PI  =  1,  we  have 

r  =  HQPQH2 

=  ||(/-(l/n)llT)P(/-(l/n)llT)||2 
=  ||  P  —  (l/n)llT||2. 

This  shows  that  the  mixing  rate  r  is  a  convex  function  of  P,  so  the  fastest  mixing 
Markov  chain  problem  can  be  cast  as  the  convex  optimization  problem 

minimize  ||P  —  (l/n)llT||2 
subject  to  PI  =  1 

Pij  >0,  i,  j  =  1, . . .  ,n 
Pij  =  0  for  (i,j)  ef,  £, 

with  variable  P  £  S".  We  can  express  the  problem  as  an  SDP  by  introducing  a 
scalar  variable  t  to  bound  the  norm  of  P  —  (l/n)llY  : 

minimize  t 

subject  to  —tl  P  P  —  (l/?r)llT  ^  tl 

PI  =  1  (4.55) 

Pij  >0,  i,  j  =  1, . . .  ,n 
Pi:j  =  0  for  (i,j)  <£  £. 


4.7  Vector  optimization 

4.7.1  General  and  convex  vector  optimization  problems 

In  §4.6  we  extended  the  standard  form  problem  (4.1)  to  include  vector-valued 
constraint  functions.  In  this  section  we  investigate  the  meaning  of  a  vector-valued 
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objective  function.  We  denote  a  general  vector  optimization  problem  as 
minimize  (with  respect  to  K)  /0(x) 

subject  to  fi{x)  <0,  i  =  1, . . . ,  m  (4.56) 

hi{x)  =  0,  i  =  l,...,p. 

Here  x  £  R"  is  the  optimization  variable,  K  C  R9  is  a  proper  cone,  /o  :  Rn  —>  R9 
is  the  objective  function,  f,  :  Rra  — »  R  are  the  inequality  constraint  functions,  and 
hi  :  R”  ->  R  are  the  equality  constraint  functions.  The  only  difference  between  this 
problem  and  the  standard  optimization  problem  (4.1)  is  that  here,  the  objective 
function  takes  values  in  R9,  and  the  problem  specification  includes  a  proper  cone 
K,  which  is  used  to  compare  objective  values.  In  the  context  of  vector  optimization, 
the  standard  optimization  problem  (4.1)  is  sometimes  called  a  scalar  optimization 
problem. 

We  say  the  vector  optimization  problem  (4.56)  is  a  convex  vector  optimization 
problem  if  the  objective  function  /o  is  AT-convex,  the  inequality  constraint  functions 
/i, . . . ,  fm  are  convex,  and  the  equality  constraint  functions  hi, ...  ,hp  are  affine. 
(As  in  the  scalar  case,  we  usually  express  the  equality  constraints  as  Ax  =  b,  where 
A  £  Rpxn.) 

What  meaning  can  we  give  to  the  vector  optimization  problem  (4.56)?  Suppose 
x  and  y  are  two  feasible  points  ( i.e .,  they  satisfy  the  constraints).  Their  associated 
objective  values,  /o(x)  and  fo{y),  are  to  be  compared  using  the  generalized  inequal¬ 
ity  ~Ak ■  We  interpret  /o(x)  - <k  fo(y)  as  meaning  that  x  is  ‘better  than  or  equal’  in 
value  to  y  (as  judged  by  the  objective  /o,  with  respect  to  K ).  The  confusing  aspect 
of  vector  optimization  is  that  the  two  objective  values  /o(x)  and  fo (y)  need  not  be 
comparable;  we  can  have  neither  /0(x)  <k  fo{y)  nor  fo(y)  ~Ak  fo(x )>  i-e.,  neither 
is  better  than  the  other.  This  cannot  happen  in  a  scalar  objective  optimization 
problem. 


4.7.2  Optimal  points  and  values 

We  first  consider  a  special  case,  in  which  the  meaning  of  the  vector  optimization 
problem  is  clear.  Consider  the  set  of  objective  values  of  feasible  points, 

O  =  {/o(x)  |  3x  e  V,  fi(x )  <0,  i  =  1, . . .  ,m,  hi(x)  =  0,  i  =  1, . . .  ,p}  C  R9, 

which  is  called  the  set  of  achievable  objective  values.  If  this  set  has  a  minimum 
element  (see  §2.4.2),  i.e.,  there  is  a  feasible  x  such  that  /o(x)  A K  fo(y)  for  all 
feasible  y,  then  we  say  x  is  optimal  for  the  problem  (4.56),  and  refer  to  /o(x)  as 
the  optimal  value  of  the  problem.  (When  a  vector  optimization  problem  has  an 
optimal  value,  it  is  unique.)  If  x*  is  an  optimal  point,  then  /o(x*),  the  objective 
at  x* ,  can  be  compared  to  the  objective  at  every  other  feasible  point,  and  is  better 
than  or  equal  to  it.  Roughly  speaking,  x*  is  unambiguously  a  best  choice  for  x, 
among  feasible  points. 

A  point  x*  is  optimal  if  and  only  if  it  is  feasible  and 


O  C  /0(x*)  +  K 


(4.57) 


176 


4  Convex  optimization  problems 


Figure  4.7  The  set  O  of  achievable  values  for  a  vector  optimization  with 
objective  values  in  R2,  with  cone  K  =  R+,  is  shown  shaded.  In  this  case, 
the  point  labeled  fo(x*)  is  the  optimal  value  of  the  problem,  and  x*  is  an 
optimal  point.  The  objective  value  fo(x*)  can  be  compared  to  every  other 
achievable  value  fo(y),  and  is  better  than  or  equal  to  fo{y)-  (Here,  ‘better 
than  or  equal  to’  means  ‘is  below  and  to  the  left  of’.)  The  lightly  shaded 
region  is  fo(x*)  +  K ,  which  is  the  set  of  all  z  £  R2  corresponding  to  objective 
values  worse  than  (or  equal  to)  fo(x*). 


(see  §2.4.2).  The  set  fo{x*)  +  K  can  be  interpreted  as  the  set  of  values  that  are 
worse  than,  or  equal  to,  so  the  condition  (4.57)  states  that  every  achievable 

value  falls  in  this  set.  This  is  illustrated  in  figure  4.7.  Most  vector  optimization 
problems  do  not  have  an  optimal  point  and  an  optimal  value,  but  this  does  occur 
in  some  special  cases. 


Example  4.9  Best  linear  unbiased  estimator.  Suppose  y  =  Ax  +  v,  where  v  £  Rm  is 
a  measurement  noise,  y  £  Rm  is  a  vector  of  measurements,  and  x  £  Rn  is  a  vector  to 
be  estimated,  given  the  measurement  y.  We  assume  that  A  has  rank  n,  and  that  the 
measurement  noise  satisfies  Ev  =  0,  EvuT  =  7,  i.e.,  its  components  are  zero  mean 
and  uncorrelated. 

A  linear  estimator  of  x  has  the  form  x  =  Fy.  The  estimator  is  called  unbiased  if  for 
all  x  we  have  Eir  =  x,  i.e.,  if  FA  =  I.  The  error  covariance  of  an  unbiased  estimator 
is 

E(S-  x)(x-  x)T  =  EFvvtFt  =  FFt. 

Our  goal  is  to  find  an  unbiased  estimator  that  has  a  ‘small’  error  covariance  matrix. 
We  can  compare  error  covariances  using  matrix  inequality,  i.e.,  with  respect  to  S" . 
This  has  the  following  interpretation:  Suppose  Si  =  Fiy,  X2  =  Fxy  are  two  unbiased 
estimators.  Then  the  first  estimator  is  at  least  as  good  as  the  second,  i.e.,  FiF i  A 
F2F2  ,  if  and  only  if  for  all  c, 

TTi  /  T  \  2  /  th  /  T  \  2 

hj(c  XI  —  C  X)  <  ti{c  X2  —  c  X)  . 

In  other  words,  for  any  linear  function  of  x,  the  estimator  F\  yields  at  least  as  good 
an  estimate  as  does  T7^. 
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We  can  express  the  problem  of  finding  an  unbiased  estimator  for  x  as  the  vector 
optimization  problem 


minimize  (w.r.t.  S")  FFT  .  . 

subject  to  FA  =  I, 

with  variable  F  £  Rnxm.  The  objective  FFT  is  convex  with  respect  to  S",  so  the 
problem  (4.58)  is  a  convex  vector  optimization  problem.  An  easy  way  to  see  this  is 
to  observe  that  vT FFTv  =  ||TTt)||2  is  a  convex  function  of  F  for  any  fixed  v. 

It  is  a  famous  result  that  the  problem  (4.58)  has  an  optimal  solution,  the  least-squares 
estimator,  or  pseudo-inverse, 


F*  =  At  =  (AtA)~1At. 

For  any  F  with  FA  =  I,  we  have  FFT  X  F*F*T .  The  matrix 

F*  F*t  =  A]A]t  =  (AT  A)-1 
is  the  optimal  value  of  the  problem  (4.58). 


4.7.3  Pareto  optimal  points  and  values 

We  now  consider  the  case  (which  occurs  in  most  vector  optimization  problems  of 
interest)  in  which  the  set  of  achievable  objective  values  does  not  have  a  minimum 
element,  so  the  problem  does  not  have  an  optimal  point  or  optimal  value.  In  these 
cases  minimal  elements  of  the  set  of  achievable  values  play  an  important  role.  We 
say  that  a  feasible  point  x  is  Pareto  optimal  (or  efficient)  if  /o(x)  is  a  minimal 
element  of  the  set  of  achievable  values  O.  In  this  case  we  say  that  /o(x)  is  a 
Pareto  optimal  value  for  the  vector  optimization  problem  (4.56).  Thus,  a  point  x 
is  Pareto  optimal  if  it  is  feasible  and,  for  any  feasible  y,  fo(y)  d:K  fo(x)  implies 
f0(y)  =  f0{x).  In  other  words:  any  feasible  point  y  that  is  better  than  or  equal  to 
x  (i.e.,  fo{y)  diK  /o(x))  has  exactly  the  same  objective  value  as  x. 

A  point  x  is  Pareto  optimal  if  and  only  if  it  is  feasible  and 

(fo(x)-K)nO  =  {f0(x)}  (4.59) 

(see  §2.4.2).  The  set  /o(x)  —  I\  can  be  interpreted  as  the  set  of  values  that  are 
better  than  or  equal  to  /o(x),  so  the  condition  (4.59)  states  that  the  only  achievable 
value  better  than  or  equal  to  /o(x)  is  /o(x)  itself.  This  is  illustrated  in  figure  4.8. 

A  vector  optimization  problem  can  have  many  Pareto  optimal  values  (and 
points).  The  set  of  Pareto  optimal  values,  denoted  V,  satisfies 

PCOnbdO, 

i.e.,  every  Pareto  optimal  value  is  an  achievable  objective  value  that  lies  in  the 
boundary  of  the  set  of  achievable  objective  values  (see  exercise  4.52). 
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Figure  4.8  The  set  O  of  achievable  values  for  a  vector  optimization  problem 
with  objective  values  in  R2,  with  cone  K  =  R+,  is  shown  shaded.  This 
problem  does  not  have  an  optimal  point  or  value,  but  it  does  have  a  set  of 
Pareto  optimal  points,  whose  corresponding  values  are  shown  as  the  dark¬ 
ened  curve  on  the  lower  left  boundary  of  O.  The  point  labeled  fo(xp°) 
is  a  Pareto  optimal  value,  and  xp°  is  a  Pareto  optimal  point.  The  lightly 
shaded  region  is  fo{xp°)  —  K,  which  is  the  set  of  all  z  £  R2  corresponding 
to  objective  values  better  than  (or  equal  to)  fo(xpo). 


4.7.4  Scalarization 

Scalarization  is  a  standard  technique  for  finding  Pareto  optimal  (or  optimal)  points 
for  a  vector  optimization  problem,  based  on  the  characterization  of  minimum  and 
minimal  points  via  dual  generalized  inequalities  given  in  §2.6.3.  Choose  any  A 
0,  i.e.,  any  vector  that  is  positive  in  the  dual  generalized  inequality.  Now  consider 
the  scalar  optimization  problem 

minimize  XT  fo(x) 

subject  to  fi(x)  <0,  i  =  1, . . . ,  m  (4.60) 

hi(x)  =  0,  i  =  l,...,p, 

and  let  x  be  an  optimal  point.  Then  x  is  Pareto  optimal  for  the  vector  optimization 
problem  (4.56).  This  follows  from  the  dual  inequality  characterization  of  minimal 
points  given  in  §2.6.3,  and  is  also  easily  shown  directly.  If  x  were  not  Pareto  optimal, 
then  there  is  a  y  that  is  feasible,  satisfies  fo(y)  diK  fo{x),  and  fo(x)  ^  fo(y)- 
Since  fo(x)  —  fo(y)  Izk  0  and  is  nonzero,  we  have  AT (fo(x)  —  fo(y))  >  0,  i.e., 
XT  f0(x)  >  XT fo(y).  This  contradicts  the  assumption  that  x  is  optimal  for  the 
scalar  problem  (4.60). 

Using  scalarization,  we  can  find  Pareto  optimal  points  for  any  vector  opti¬ 
mization  problem  by  solving  the  ordinary  scalar  optimization  problem  (4.60).  The 
vector  A,  which  is  sometimes  called  the  weight  vector ,  must  satisfy  A  0.  The 
weight  vector  is  a  free  parameter;  by  varying  it  we  obtain  (possibly)  different  Pareto 
optimal  solutions  of  the  vector  optimization  problem  (4.56).  This  is  illustrated  in 
figure  4.9.  The  figure  also  shows  an  example  of  a  Pareto  optimal  point  that  cannot 


4.7  Vector  optimization 


179 


Figure  4.9  Scalarization.  The  set  O  of  achievable  values  for  a  vector  opti¬ 
mization  problem  with  cone  I\  =  R+.  Three  Pareto  optimal  values  fo(xi), 
f 0(002),  fo (003)  are  shown.  The  first  two  values  can  be  obtained  by  scalar¬ 
ization:  fo(xi)  minimizes  Xfu  over  all  u  £  O  and  fo(x2)  minimizes  A 2  U, 
where  Ai,A2  >-  0.  The  value  fo(x3)  is  Pareto  optimal,  but  cannot  be  found 
by  scalarization. 


be  obtained  via  scalarization,  for  any  value  of  the  weight  vector  A  >-k*  0. 

The  method  of  scalarization  can  be  interpreted  geometrically.  A  point  x  is 
optimal  for  the  scalarized  problem,  i.e.,  minimizes  XT fo  over  the  feasible  set,  if 
and  only  if  A T{fo(y)  —  fo(x))  >  0  for  all  feasible  y.  But  this  is  the  same  as  saying 
that  {it  |  —  XT(u—  fo(x))  =  0}  is  a  supporting  hyperplane  to  the  set  of  achievable 
objective  values  O  at  the  point  fo(x);  in  particular 

{u|AT(u~/oOr))<O}ne>  =  0.  (4.61) 

(See  figure  4.9.)  Thus,  when  we  find  an  optimal  point  for  the  scalarized  problem,  we 
not  only  find  a  Pareto  optimal  point  for  the  original  vector  optimization  problem; 
we  also  find  an  entire  halfspace  in  R4,  given  by  (4.61),  of  objective  values  that 
cannot  be  achieved. 

Scalarization  of  convex  vector  optimization  problems 

Now  suppose  the  vector  optimization  problem  (4.56)  is  convex.  Then  the  scalarized 
problem  (4.60)  is  also  convex,  since  A T/0  is  a  (scalar-valued)  convex  function  (by 
the  results  in  §3.6).  This  means  that  we  can  find  Pareto  optimal  points  of  a  convex 
vector  optimization  problem  by  solving  a  convex  scalar  optimization  problem.  For 
each  choice  of  the  weight  vector  A  >-  k*  0  we  get  a  (usually  different)  Pareto  optimal 
point. 

For  convex  vector  optimization  problems  we  have  a  partial  converse:  For  every 
Pareto  optimal  point  xpo,  there  is  some  nonzero  A  0  such  that  xpo  is  a  solution 
of  the  scalarized  problem  (4.60).  So,  roughly  speaking,  for  convex  problems  the 
method  of  scalarization  yields  all  Pareto  optimal  points,  as  the  weight  vector  A 
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varies  over  the  I\  *-nonnegative,  nonzero  values.  We  have  to  be  careful  here,  because 
it  is  not  true  that  every  solution  of  the  scalarized  problem,  with  A  A/c*  0  and  A  ^  0, 
is  a  Pareto  optimal  point  for  the  vector  problem.  (In  contrast,  every  solution  of 
the  scalarized  problem  with  A  >~k*  0  is  Pareto  optimal.) 

In  some  cases  we  can  use  this  partial  converse  to  find  all  Pareto  optimal  points 
of  a  convex  vector  optimization  problem.  Scalarization  with  A  )~k*  0  gives  a  set 
of  Pareto  optimal  points  (as  it  would  in  a  nonconvex  vector  optimization  problem 
as  well).  To  find  the  remaining  Pareto  optimal  solutions,  we  have  to  consider 
nonzero  weight  vectors  A  that  satisfy  A  0.  For  each  such  weight  vector,  we 
first  identify  all  solutions  of  the  scalarized  problem.  Then  among  these  solutions  we 
must  check  which  are,  in  fact,  Pareto  optimal  for  the  vector  optimization  problem. 
These  ‘extreme’  Pareto  optimal  points  can  also  be  found  as  the  limits  of  the  Pareto 
optimal  points  obtained  from  positive  weight  vectors. 

To  establish  this  partial  converse,  we  consider  the  set 

A  =  O  +  K  =  {t  £  R9  |  fo{x)  <k  t  for  some  feasible  x},  (4.62) 

which  consists  of  all  values  that  are  worse  than  or  equal  to  (with  respect  to  ^k) 
some  achievable  objective  value.  While  the  set  O  of  achievable  objective  values 
need  not  be  convex,  the  set  A  is  convex,  when  the  problem  is  convex.  Moreover, 
the  minimal  elements  of  A  are  exactly  the  same  as  the  minimal  elements  of  the 
set  O  of  achievable  values,  i.e.,  they  are  the  same  as  the  Pareto  optimal  values. 
(See  exercise  4.53.)  Now  we  use  the  results  of  §2.6.3  to  conclude  that  any  minimal 
element  of  A  minimizes  A T z  over  A  for  some  nonzero  A  0.  This  means  that 
every  Pareto  optimal  point  for  the  vector  optimization  problem  is  optimal  for  the 
scalarized  problem,  for  some  nonzero  weight  A  >; k *  0. 


Example  4.10  Minimal  upper  bound  on  a  set  of  matrices.  We  consider  the  (convex) 
vector  optimization  problem,  with  respect  to  the  positive  semidefinite  cone, 

minimize  (w.r.t.  S")  X  .  . 

subject  to  A'  X  Ai,  i  =  1, . . . ,  m,  ' 


where  Ai  £  S",  i  =  1, . . . ,  m,  are  given.  The  constraints  mean  that  X  is  an  upper 
bound  on  the  given  matrices  Ai, . . . ,  Am\  a  Pareto  optimal  solution  of  (4.63)  is  a 
minimal  upper  bound  on  the  matrices. 

To  find  a  Pareto  optimal  point,  we  apply  scalarization:  we  choose  any  W  £  S++  and 
form  the  problem 

minimize  tr(WAY) 

subject  to  X  >  Ai,  i  =  1, . . . ,  m,  ' 

which  is  an  SDP.  Different  choices  for  W  will,  in  general,  give  different  minimal 
solutions. 

The  partial  converse  tells  us  that  if  X  is  Pareto  optimal  for  the  vector  problem  (4.63) 
then  it  is  optimal  for  the  SDP  (4.64),  for  some  nonzero  weight  matrix  W  >  0. 
(In  this  case,  however,  not  every  solution  of  (4.64)  is  Pareto  optimal  for  the  vector 
optimization  problem.) 

We  can  give  a  simple  geometric  interpretation  for  this  problem.  We  associate  with 
each  A  £  S++  an  ellipsoid  centered  at  the  origin,  given  by 

SA  =  {u  \  uT A~xu  <  1}, 


4.7  Vector  optimization 


181 


*2 


Figure  4.10  Geometric  interpretation  of  the  problem  (4.63).  The  three 
shaded  ellipsoids  correspond  to  the  data  A\,  A2,  A3  €  S++j  the  Pareto 
optimal  points  correspond  to  minimal  ellipsoids  that  contain  them.  The  two 
ellipsoids,  with  boundaries  labeled  AT  and  AT,  show  two  minimal  ellipsoids 
obtained  by  solving  the  SDP  (4.64)  for  two  different  weight  matrices  Wi  and 
W2. 


so  that  A  <  B  if  and  only  if  £a  C  £b-  A  Pareto  optimal  point  X  for  the  prob¬ 
lem  (4.63)  corresponds  to  a  minimal  ellipsoid  that  contains  the  ellipsoids  associated 
with  A 1 , . . .  ,  Am.  An  example  is  shown  in  figure  4.10. 


4.7.5  Multicriterion  optimization 

When  a  vector  optimization  problem  involves  the  cone  K  =  R^_,  it  is  called  a 
multicriterion  or  multi- objective  optimization  problem.  The  components  of  /o, 
say,  Fi, ...  ,Fq,  can  be  interpreted  as  q  different  scalar  objectives,  each  of  which 
we  would  like  to  minimize.  We  refer  to  F,  as  the  ith  objective  of  the  problem.  A 
multicriterion  optimization  problem  is  convex  if  /1, . . . ,  fm  are  convex,  h\, ...  ,hp 
are  affine,  and  the  objectives  F\, . . . ,  Fq  are  convex. 

Since  multicriterion  problems  are  vector  optimization  problems,  all  of  the  ma¬ 
terial  of  §4.7.1-§4.7.4  applies.  For  multicriterion  problems,  though,  we  can  be  a 
bit  more  specific  in  the  interpretations.  If  x  is  feasible,  we  can  think  of  Fjjx)  as 
its  score  or  value,  according  to  the  ith  objective.  If  x  and  y  are  both  feasible, 
Fi(x)  <  Fi(y)  means  that  x  is  at  least  as  good  as  y,  according  to  the  ith  objective; 
F.i(x)  <  Fi(y)  means  that  x  is  better  than  y,  or  x  beats  y,  according  to  the  ith  ob¬ 
jective.  If  x  and  y  are  both  feasible,  we  say  that  x  is  better  than  y,  or  x  dominates 
y,  if  Fi(x)  <  Fi{y)  for  i  =  1, . . . ,  q,  and  for  at  least  one  j,  Fj(x)  <  Fj{y).  Roughly 
speaking,  x  is  better  than  y  if  x  meets  or  beats  y  on  all  objectives,  and  beats  it  in 
at  least  one  objective. 

In  a  multicriterion  problem,  an  optimal  point  x*  satisfies 


Fi(x*)  <  Fi(y),  i  =  l,...,q, 
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for  every  feasible  y.  In  other  words,  x*  is  simultaneously  optimal  for  each  of  the 
scalar  problems 

minimize  Fj(x) 

subject  to  fi(x)  <0,  i  =  1, . . . ,  m 
hi(x)  =  0,  i  =  l,...,p, 

for  j  =  1, . . .  ,  g.  When  there  is  an  optimal  point,  we  say  that  the  objectives  are 
noncompeting ,  since  no  compromises  have  to  be  made  among  the  objectives;  each 
objective  is  as  small  as  it  could  be  made,  even  if  the  others  were  ignored. 

A  Pareto  optimal  point  xpo  satisfies  the  following:  if  y  is  feasible  and  F,;(y)  < 
Fi( xpo)  for  *  =  1, . . . ,  q,  then  Fi(xpo)  =  Fi(y),  i  =  1, . . . ,  q.  This  can  be  restated 
as:  a  point  is  Pareto  optimal  if  and  only  if  it  is  feasible  and  there  is  no  better 
feasible  point.  In  particular,  if  a  feasible  point  is  not  Pareto  optimal,  there  is  at 
least  one  other  feasible  point  that  is  better.  In  searching  for  good  points,  then,  we 
can  clearly  limit  our  search  to  Pareto  optimal  points. 

Trade-off  analysis 

Now  suppose  that  x  and  y  are  Pareto  optimal  points  with,  say, 

Fi{x)  <  Fi(y),  %  e  A 
Fi(x)  =  Fi(y),  i  G  B 

Fi{x)  >  Fi(y),  ie  C, 

where  AUBUC  =  {1, . . . ,  q}.  In  other  words,  A  is  the  set  of  (indices  of)  objectives 
for  which  x  beats  y,  B  is  the  set  of  objectives  for  which  the  points  x  and  y  are  tied, 
and  C  is  the  set  of  objectives  for  which  y  beats  x.  If  A  and  C  are  empty,  then 
the  two  points  x  and  y  have  exactly  the  same  objective  values.  If  this  is  not  the 
case,  then  both  A  and  C  must  be  nonempty.  In  other  words,  when  comparing  two 
Pareto  optimal  points,  they  either  obtain  the  same  performance  (i.e.,  all  objectives 
equal),  or,  each  beats  the  other  in  at  least  one  objective. 

In  comparing  the  point  x  to  y,  we  say  that  we  have  traded  or  traded  off  better 
objective  values  for  i  €  A  for  worse  objective  values  for  i  €  C.  Optimal  trade-off 
analysis  (or  just  trade-off  analysis)  is  the  study  of  how  much  worse  we  must  do 
in  one  or  more  objectives  in  order  to  do  better  in  some  other  objectives,  or  more 
generally,  the  study  of  what  sets  of  objective  values  are  achievable. 

As  an  example,  consider  a  bi-criterion  (ie.,  two  criterion)  problem.  Suppose 
a:  is  a  Pareto  optimal  point,  with  objectives  F±(x)  and  1*2(2,').  We  might  ask  how 
much  larger  F2{z)  would  have  to  be,  in  order  to  obtain  a  feasible  point  2  with 
F\{z)  <  Fi(x)  —  a,  where  a  >  0  is  some  constant.  Roughly  speaking,  we  are  asking 
how  much  we  must  pay  in  the  second  objective  to  obtain  an  improvement  of  a  in 
the  first  objective.  If  a  large  increase  in  F2  must  be  accepted  to  realize  a  small 
decrease  in  F\,  we  say  that  there  is  a  strong  trade-off  between  the  objectives,  near 
the  Pareto  optimal  value  (Fi(x),F2(x)).  If,  on  the  other  hand,  a  large  decrease 
in  i*\  can  be  obtained  with  only  a  small  increase  in  F2,  we  say  that  the  trade-off 
between  the  objectives  is  weak  (near  the  Pareto  optimal  value  (Fi(x),  F2(x))). 

We  can  also  consider  the  case  in  which  we  trade  worse  performance  in  the  first 
objective  for  an  improvement  in  the  second.  Here  we  find  how  much  smaller  F2(z) 
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can  be  made,  to  obtain  a  feasible  point  2  with  F\(z)  <  F\(x)  +  a,  where  a  >  0 
is  some  constant.  In  this  case  we  receive  a  benefit  in  the  second  objective,  i.e.,  a 
reduction  in  F2  compared  to  F2(x).  If  this  benefit  is  large  (i.e.,  by  increasing  F\ 
a  small  amount  we  obtain  a  large  reduction  in  F2),  we  say  the  objectives  exhibit 
a  strong  trade-off.  If  it  is  small,  we  say  the  objectives  trade  off  weakly  (near  the 
Pareto  optimal  value  (Fi(x),  F2(x))). 

Optimal  trade-off  surface 

The  set  of  Pareto  optimal  values  for  a  multicriterion  problem  is  called  the  optimal 
trade-off  surface  (in  general,  when  q  >  2)  or  the  optimal  trade-off  curve  (when 
q  =  2).  (Since  it  would  be  foolish  to  accept  any  point  that  is  not  Pareto  optimal, 
we  can  restrict  our  trade-off  analysis  to  Pareto  optimal  points.)  Trade-off  analysis 
is  also  sometimes  called  exploring  the  optimal  trade-off  surface.  (The  optimal  trade¬ 
off  surface  is  usually,  but  not  always,  a  surface  in  the  usual  sense.  If  the  problem 
has  an  optimal  point,  for  example,  the  optimal  trade-off  surface  consists  of  a  single 
point,  the  optimal  value.) 

An  optimal  trade-off  curve  is  readily  interpreted.  An  example  is  shown  in 
figure  4.11,  on  page  185,  for  a  (convex)  bi-criterion  problem.  From  this  curve  we 
can  easily  visualize  and  understand  the  trade-offs  between  the  two  objectives. 

•  The  endpoint  at  the  right  shows  the  smallest  possible  value  of  F2,  without 
any  consideration  of  _F) . 

•  The  endpoint  at  the  left  shows  the  smallest  possible  value  of  F± ,  without  any 
consideration  of  F2. 

•  By  finding  the  intersection  of  the  curve  with  a  vertical  line  at  F\  =  a,  we  can 
see  how  large  F2  must  be  to  achieve  Fi  <  a. 

•  By  finding  the  intersection  of  the  curve  with  a  horizontal  line  at  F2  =  /?,  we 
can  see  how  large  F\  must  be  to  achieve  F2  <  (3. 

•  The  slope  of  the  optimal  trade-off  curve  at  a  point  on  the  curve  (i.e.,  a  Pareto 
optimal  value)  shows  the  local  optimal  trade-off  between  the  two  objectives. 
Where  the  slope  is  steep,  small  changes  in  F\  are  accompanied  by  large 
changes  in  F2. 

•  A  point  of  large  curvature  is  one  where  small  decreases  in  one  objective  can 
only  be  accomplished  by  a  large  increase  in  the  other.  This  is  the  prover¬ 
bial  knee  of  the  trade-off  curve,  and  in  many  applications  represents  a  good 
compromise  solution. 

All  of  these  have  simple  extensions  to  a  trade-off  surface,  although  visualizing  a 
surface  with  more  than  three  objectives  is  difficult. 

Scalarizing  multicriterion  problems 

When  we  scalarize  a  multicriterion  problem  by  forming  the  weighted  sum  objective 

A T.f0(x )  =  iFi(x), 

i= 1 
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where  A  >~  0,  we  can  interpret  A;  as  the  weight  we  attach  to  the  ith  objective. 
The  weight  A i  can  be  thought  of  as  quantifying  our  desire  to  make  F.t  small  (or 
our  objection  to  having  Ft  large).  In  particular,  we  should  take  A i  large  if  we 
want  Ft  to  be  small;  if  we  care  much  less  about  Ft,  we  can  take  A,;  small.  We  can 
interpret  the  ratio  \/\ j  as  the  relative  weight  or  relative  importance  of  the  ?'th 
objective  compared  to  the  jth  objective.  Alternatively,  we  can  think  of  Xi/Xj  as 
exchange  rate  between  the  two  objectives,  since  in  the  weighted  sum  objective  a 
decrease  (say)  in  T)  by  a  is  considered  the  same  as  an  increase  in  Fj  in  the  amount 
(Ai/A>. 

These  interpretations  give  us  some  intuition  about  how  to  set  or  change  the 
weights  while  exploring  the  optimal  trade-off  surface.  Suppose,  for  example,  that 
the  weight  vector  A  >~  0  yields  the  Pareto  optimal  point  xpo,  with  objective  values 
T\(xpo), . . .  ,Fq{xpo).  To  find  a  (possibly)  new  Pareto  optimal  point  which  trades 
off  a  better  fcth  objective  value  (say),  for  (possibly)  worse  objective  values  for  the 
other  objectives,  we  form  a  new  weight  vector  A  with 

Xk  A  Afc,  A j  —  Xj ,  j  /  j =  1)  •  *  • ,  Q, 

i.e.,  we  increase  the  weight  on  the  fcth  objective.  This  yields  a  new  Pareto  optimal 
point  xpo  with  Fk(xp°)  <  Fk{xpo)  (and  usually,  Fk{xpo)  <  F^{ xpo)),  i.e.,  a  new 
Pareto  optimal  point  with  an  improved  fcth  objective. 

We  can  also  see  that  at  any  point  where  the  optimal  trade-off  surface  is  smooth, 
A  gives  the  inward  normal  to  the  surface  at  the  associated  Pareto  optimal  point. 
In  particular,  when  we  choose  a  weight  vector  A  and  apply  scalarization,  we  obtain 
a  Pareto  optimal  point  where  A  gives  the  local  trade-offs  among  objectives. 

In  practice,  optimal  trade-off  surfaces  are  explored  by  ad  hoc  adjustment  of  the 
weights,  based  on  the  intuitive  ideas  above.  We  will  see  later  (in  chapter  5)  that 
the  basic  idea  of  scalarization,  i.e.,  minimizing  a  weighted  sum  of  objectives,  and 
then  adjusting  the  weights  to  obtain  a  suitable  solution,  is  the  essence  of  duality. 


4.7.6  Examples 

Regularized  least-squares 

We  are  given  A  £  Rmxn  and  5  g  Rm,  anc[  want  to  choose  x  £  R"  taking  into 
account  two  quadratic  objectives: 

•  F1(x)  =  ||  Ax  —  b\\2  =  xTATAx  —  2bTAx  +  bTb  is  a  measure  of  the  misfit 
between  Ax  and  b, 

•  F2(x)  =  ||x|||  =  xTx  is  a  measure  of  the  size  of  x. 

Our  goal  is  to  find  x  that  gives  a  good  fit  {i.e.,  small  F{)  and  that  is  not  large  {i.e., 
small  F2).  We  can  formulate  this  problem  as  a  vector  optimization  problem  with 
respect  to  the  cone  R^_,  i.e.,  a  bi-criterion  problem  (with  no  constraints): 


minimize  (w.r.t.  R+)  fo{x)  =  {F1{x),F2{x)). 
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Fi(x)  =  \\Ax-  b\\l 

Figure  4.11  Optimal  trade-off  curve  for  a  regularized  least-squares  problem. 
The  shaded  set  is  the  set  of  achievable  values  (||Ae  —  6|| 2 ,  11*11!)-  The  optimal 
trade-off  curve,  shown  darker,  is  the  lower  left  part  of  the  boundary. 


We  can  scalarize  this  problem  by  taking  Ai  >  0  and  A2  >  0  and  minimizing  the 
scalar  weighted  sum  objective 

At/0(»  =  AiFi(a;)  +  A2F2(a:) 

=  xT  (XiAT  A  +  X2I)x  —  2XibT  Ax  +  XibTb, 

which  yields 

x(n)  =  (Ai  AtA  +  X2I)~1X1ATb  =  ( AtA  +  n I)~1ATb , 

where  /r  =  A2/Ai.  For  any  /i  >  0,  this  point  is  Pareto  optimal  for  the  bi-criterion 
problem.  We  can  interpret  //  =  A2/Ai  as  the  relative  weight  we  assign  F2  compared 
to  F[ . 

This  method  produces  all  Pareto  optimal  points,  except  two,  associated  with 
the  extremes  /i  — >  00  and  /j  — >  0.  In  the  first  case  we  have  the  Pareto  optimal 
solution  x  =  0,  which  would  be  obtained  by  scalarization  with  A  =  (0, 1).  At  the 
other  extreme  we  have  the  Pareto  optimal  solution  A^b,  where  A^  is  the  pseudo¬ 
inverse  of  A.  This  Pareto  optimal  solution  is  obtained  as  the  limit  of  the  optimal 
solution  of  the  scalarized  problem  as  n  — >  0,  i.e.,  as  A  — >  (1,  0).  (We  will  encounter 
the  regularized  least-squares  problem  again  in  §6.3.2.) 

Figure  4.11  shows  the  optimal  trade-off  curve  and  the  set  of  achievable  values 
for  a  regularized  least-squares  problem  with  problem  data  A  £  Rln0x  °,  b  £  R100. 
(See  exercise  4.50  for  more  discussion.) 

Risk-return  trade-off  in  portfolio  optimization 

The  classical  Markowitz  portfolio  optimization  problem  described  on  page  155  is 
naturally  expressed  as  a  bi-criterion  problem,  where  the  objectives  are  the  negative 
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mean  return  (since  we  wish  to  maximize  mean  return)  and  the  variance  of  the 
return: 

minimize  (w.r.t.  R/j_)  (Fi(x),  ^(x))  =  (—  pTx,  xTYix) 

subject  to  lTx  =  1,  x  y  0. 

In  forming  the  associated  scalarized  problem,  we  can  (without  loss  of  generality) 
take  Ai  =  1  and  X2  =  p  >  0: 


minimize  —pTx  +  pxTY,x 
subject  to  1T x  =  1,  x  >L  0, 


which  is  a  QP.  In  this  example  too,  we  get  all  Pareto  optimal  portfolios  except  for 
the  two  limiting  cases  corresponding  to  /i  — ►  0  and  p  —>  00.  Roughly  speaking,  in 
the  first  case  we  get  a  maximum  mean  return,  without  regard  for  return  variance; 
in  the  second  case  we  form  a  minimum  variance  return,  without  regard  for  mean 
return.  Assuming  that  pk  >  pt  for  i  k,  i.e.,  that  asset  k  is  the  unique  asset  with 
maximum  mean  return,  the  portfolio  allocation  x  =  ek  is  the  only  one  correspond¬ 
ing  to  /i  — >  0.  (In  other  words,  we  concentrate  the  portfolio  entirely  in  the  asset 
that  has  maximum  mean  return.)  In  many  portfolio  problems  asset  n  corresponds 
to  a  risk-free  investment,  with  (deterministic)  return  rrf.  Assuming  that  E,  with  its 
last  row  and  column  (which  are  zero)  removed,  is  full  rank,  then  the  other  extreme 
Pareto  optimal  portfolio  is  x  =  en,  i.e.,  the  portfolio  is  concentrated  entirely  in  the 
risk-free  asset. 

As  a  specific  example,  we  consider  a  simple  portfolio  optimization  problem  with 
4  assets,  with  price  change  mean  and  standard  deviations  given  in  the  following 
table. 

Asset  pj  E-/2 

1  12%  20% 

2  10%  10% 

3  7%  5% 

4  3%  0% 

Asset  4  is  a  risk-free  asset,  with  a  (certain)  3%  return.  Assets  3,  2,  and  1  have 
increasing  mean  returns,  ranging  from  7%  to  12%,  as  well  as  increasing  standard 
deviations,  which  range  from  5%  to  20%.  The  correlation  coefficients  between  the 
assets  are  p\2  =  30%,  P13  =  —40%,  and  P23  =  0%. 

Figure  4.12  shows  the  optimal  trade-off  curve  for  this  portfolio  optimization 
problem.  The  plot  is  given  in  the  conventional  way,  with  the  horizontal  axis  show¬ 
ing  standard  deviation  (i.e.,  squareroot  of  variance)  and  the  vertical  axis  showing 
expected  return.  The  lower  plot  shows  the  optimal  asset  allocation  vector  x  for 
each  Pareto  optimal  point. 

The  results  in  this  simple  example  agree  with  our  intuition.  For  small  risk, 
the  optimal  allocation  consists  mostly  of  the  risk-free  asset,  with  a  mixture  of  the 
other  assets  in  smaller  quantities.  Note  that  a  mixture  of  asset  3  and  asset  1,  which 
are  negatively  correlated,  gives  some  hedging,  i.e.,  lowers  variance  for  a  given  level 
of  mean  return.  At  the  other  end  of  the  trade-off  curve,  we  see  that  aggressive 
growth  portfolios  (i.e.,  those  with  large  mean  returns)  concentrate  the  allocation 
in  assets  1  and  2,  the  ones  with  the  largest  mean  returns  (and  variances). 
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0%  10%  20% 


standard  deviation  of  return 


Figure  4.12  Top.  Optimal  risk-return  trade-off  curve  for  a  simple  portfolio 
optimization  problem.  The  lefthand  endpoint  corresponds  to  putting  all 
resources  in  the  risk-free  asset,  and  so  has  zero  standard  deviation.  The 
righthand  endpoint  corresponds  to  putting  all  resources  in  asset  1,  which 
has  highest  mean  return.  Bottom.  Corresponding  optimal  allocations. 
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Research  on  quadratic  programming  began  in  the  1950s  (see,  e.g.,  Frank  and  Wolfe 
[FW56],  Markowitz  [Mar56],  Hildreth  [Hil57] ) ,  and  was  in  part  motivated  by  the  portfo¬ 
lio  optimization  problem  discussed  on  page  155  (Markowitz  [Mar52]),  and  the  LP  with 
random  cost  discussed  on  page  154  (see  Freund  [Fre56]). 

Interest  in  second-order  cone  programming  is  more  recent,  and  started  with  Nesterov 
and  Nemirovski  [NN94,  §6.2.3].  The  theory  and  applications  of  SOCPs  are  surveyed  by 
Alizadeh  and  Goldfarb  [AG03],  Ben-Tal  and  Nemirovski  [BTN01,  lecture  3]  (where  the 
problem  is  referred  to  as  conic  quadratic  programming ),  and  Lobo,  Vandenberghe,  Boyd, 
and  Lcbret  [LVBL98]. 

Robust  linear  programming,  and  robust  convex  optimization  in  general,  originated  with 
Ben-Tal  and  Nemirovski  [BTN98,  BTN99]  and  El  Ghaoui  and  Lcbret  [EL97].  Goldfarb 
and  Iyengar  [GI03a,  GI03b]  discuss  robust  QCQPs  and  applications  in  portfolio  optimiza¬ 
tion.  El  Ghaoui,  Oustry,  and  Lebret  [EOL98]  focus  on  robust  semidefinite  programming. 

Geometric  programming  has  been  known  since  the  1960s.  Its  use  in  engineering  design 
was  first  advocated  by  DufSn,  Peterson,  and  Zener  [DPZ67]  and  Zener  [Zen71].  Peterson 
[Pet76]  and  Ecker  [Eck80]  describe  the  progress  made  during  the  1970s.  These  articles 
and  books  also  include  examples  of  engineering  applications,  in  particular  in  chemical 
and  civil  engineering.  Fishburn  and  Dunlop  [FD85],  Sapatnekar,  Rao,  Vaidya,  and  Kang 
[SRVK93],  and  Hershenson,  Boyd,  and  Lee  [HBL01])  apply  geometric  programming  to 
problems  in  integrated  circuit  design.  The  cantilever  beam  design  example  (page  163) 
is  from  Vanderplaats  [Van84,  page  147].  The  variational  characterization  of  the  Perron- 
Frobenius  eigenvalue  (page  165)  is  proved  in  Berman  and  Plemmons  [BP94,  page  31]. 

Nesterov  and  Nemirovski  [NN94,  chapter  4]  introduced  the  conic  form  problem  (4.49) 
as  a  standard  problem  format  in  nonlinear  convex  optimization.  The  cone  programming 
approach  is  further  developed  in  Ben-Tal  and  Nemirovski  [BTN01],  who  also  describe 
numerous  applications. 

Alizadeh  [Ali91]  and  Nesterov  and  Nemirovski  [NN94,  §6.4]  were  the  first  to  make  a 
systematic  study  of  semidefinite  programming,  and  to  point  out  the  wide  variety  of 
applications  in  convex  optimization.  Subsequent  research  in  semidefinite  programming 
during  the  1990s  was  driven  by  applications  in  combinatorial  optimization  (Goemans 
and  Williamson  [GW95]),  control  (Boyd,  El  Ghaoui,  Feron,  and  Balakrishnan  [BEFB94], 
Scherer,  Gahinet,  and  Chilali  [SGC97],  Dullerud  and  Paganini  [DP00]),  communications 
and  signal  processing  (Luo  [Luo03],  Davidson,  Luo,  Wong,  and  Ma  [DLW00,  MDW+02]), 
and  other  areas  of  engineering.  The  book  edited  by  Wolkowicz,  Saigal,  and  Vandenberghe 
[WSV00]  and  the  articles  by  Todd  [TodOl],  Lewis  and  Overton  [L096],  and  Vandenberghe 
and  Boyd  [VB95]  provide  overviews  and  extensive  bibliographies.  Connections  between 
SDP  and  moment  problems,  of  which  we  give  a  simple  example  on  page  170,  are  explored 
in  detail  by  Bertsimas  and  Sethuraman  [BS00],  Nesterov  [NesOO],  and  Lasserre  [Las02]. 
The  fastest  mixing  Markov  chain  problem  is  from  Boyd,  Diaconis,  and  Xiao  [BDX04]. 

Multicriterion  optimization  and  Pareto  optimality  are  fundamental  tools  in  economics; 
see  Pareto  [Par71],  Debreu  [Deb59]  and  Luenberger  [Lue95].  The  result  in  example  4.9  is 
known  as  the  Gauss-Markov  theorem  (Kailath,  Sayed,  and  Hassibi  [KSH00,  page  97]). 
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Exercises 


Basic  terminology  and  optimality  conditions 

4.1  Consider  the  optimization  problem 

minimize  fo  (*i ,  *2) 

subject  to  2*i  +  *2  >  1 
*1  +  3*2  >  1 

*1  >0,  *2  >  0. 


Make  a  sketch  of  the  feasible  set.  For  each  of  the  following  objective  functions,  give  the 


optimal  set 

and  the  optimal  value. 

(a) 

x2) 

=  *1  +  *2. 

(b) 

fo  (*1 , 

*2) 

=  — *1  —  *2. 

(c) 

fo  (*1 , 

*2) 

=  *1. 

(d) 

fo  (*1 , 

*2) 

=  max{*i,  *2}. 

(e) 

fo  (*1  ; 

*2) 

=  *1  +  9*2- 

4.2  Consider  the  optimization  problem 

minimize  f0(x)  =  -  YjT=i  log(6i  “  *) 

with  domain  dom/o  =  {*  |  Ax  -<  6},  where  A  £  Rmxn  (with  rows  aj).  We  assume  that 
dom  fo  is  nonempty. 

Prove  the  following  facts  (which  include  the  results  quoted  without  proof  on  page  141). 

(a)  dom  fo  is  unbounded  if  and  only  if  there  exists  au^O  with  Av  X  0. 

(b)  fo  is  unbounded  below  if  and  only  if  there  exists  a  v  with  Av  A  0,  Av  ^  0.  Hint. 
There  exists  a  v  such  that  Av  <  0,  Av  ^  0  if  and  only  if  there  exists  no  z  >-  0 
such  that  AT z  =  0.  This  follows  from  the  theorem  of  alternatives  in  example  2.21, 
page  50. 

(c)  If  fo  is  bounded  below  then  its  minimum  is  attained,  i.e.,  there  exists  an  *  that 
satisfies  the  optimality  condition  (4.23). 

(d)  The  optimal  set  is  affine:  Xopt  =  {**  +  v  \  Av  =  0},  where  **  is  any  optimal  point. 

4.3  Prove  that  x*  =  (1, 1/2,  —1)  is  optimal  for  the  optimization  problem 


where 


minimize  (1/2  )xTPx  +  qTx  +  r 

subject  to  —  1  <  Xi  <  1,  i  =  1,  2,3, 


'  13 

12 

-2  ' 

'  -22.0  ' 

12 

17 

6 

,  <?  = 

-14.5 

.  -2 

6 

12  . 

13.0  . 

4.4  [P.  Parrilo]  Symmetries  and  convex  optimization.  Suppose  Q  =  {Qi, . . . ,  Qk}  C  RnX"  is  a 
group,  i.e.,  closed  under  products  and  inverse.  We  say  that  the  function  /  :  R11  — >  R  is  Q- 
invariant,  or  symmetric  with  respect  to  Q,  if  f(Qix)  =  /(*)  holds  for  all  *  and  i  =  1, . . . ,  k. 
We  define  *  =  (1  /k)  X^-i  which  is  the  average  of  *  over  its  Cf-orbit.  We  define  the 
fixed  subspace  of  Q  as 

T  =  {*  |  QiX  =  x,  i  —  1, . . . ,  k}. 

(a)  Show  that  for  any  *  £  R",  we  have  *  £  T . 
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(b)  Show  that  if  /  :  Rn  — »  R  is  convex  and  (/-invariant,  then  f(x)  <  /(*). 

(c)  We  say  the  optimization  problem 

minimize  fo  ( x ) 

subject  to  fi(x)  <0,  i  =  1, . . . ,  m 

is  (/-invariant  if  the  objective  fo  is  (/-invariant,  and  the  feasible  set  is  (/-invariant, 
which  means 

fl{x)  <  0,  .  .  .  ,fm(x)  <  0  =>■  fl{QiX)  <  0,  . .  .  ,  fm(QiX)  <  0, 

for  i  =  1, . . . ,  k.  Show  that  if  the  problem  is  convex  and  (/-invariant,  and  there  exists 
an  optimal  point,  then  there  exists  an  optimal  point  in  T .  In  other  words,  we  can 
adjoin  the  equality  constraints  x  £  T  to  the  problem,  without  loss  of  generality. 

(d)  As  an  example,  suppose  /  is  convex  and  symmetric,  i.e.,  f(Px)  =  f(x)  for  every 
permutation  P.  Show  that  if  /  has  a  minimizer,  then  it  has  a  minimizer  of  the  form 
oil.  (This  means  to  minimize  /  over  x  £  R™,  we  can  just  as  well  minimize  f(t  1) 
over  t  £  R.) 

4.5  Equivalent  convex  problems.  Show  that  the  following  three  convex  problems  are  equiva¬ 
lent.  Carefully  explain  how  the  solution  of  each  problem  is  obtained  from  the  solution  of 
the  other  problems.  The  problem  data  are  the  matrix  A  £  Rmxn  (with  rows  a?),  the 
vector  b  £  Rm,  and  the  constant  M  >  0. 

(a)  The  robust  least-squares  problem 

minimize  y™t  <j>( aj x  —  bj), 

with  variable  x  £  R“,  where  <j>  :  R  — >  R  is  defined  as 

,,  n  _  /  u2  |w|  <  M 

<P(U>  ~  \  M(2\u\  -  M)  |m|  >  M. 

(This  function  is  known  as  the  Huber  penalty  function;  see  §6.1.2.) 

(b)  The  least-squares  problem  with  variable  weights 

minimize  "I1(afa:  —  &i)2/ (ic»i  +  1)  +  M2  lTw 

subject  to  w  y  0, 

with  variables  x  £  Rn  and  w  £  Rm,  and  domain  T>  =  {(*,  w)  £  R"  x  Rm  |  w  > - 1}. 

Hint.  Optimize  over  w  assuming  x  is  fixed,  to  establish  a  relation  with  the  problem 
in  part  (a). 

(This  problem  can  be  interpreted  as  a  weighted  least-squares  problem  in  which  we 
are  allowed  to  adjust  the  weight  of  the  ith  residual.  The  weight  is  one  if  Wi  =  0,  and 
decreases  if  we  increase  wi.  The  second  term  in  the  objective  penalizes  large  values 
of  w,  i.e.,  large  adjustments  of  the  weights.) 

(c)  The  quadratic  program 


minimize  y  X]  (ui  +  2Md;  ) 
subject  to  —  u  —  v  <  Ax  —  b  <  u  +  v 
0  ■<  u  <  M 1 

v  y  o. 
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4.6  Handling  convex  equality  constraints.  A  convex  optimization  problem  can  have  only  linear 
equality  constraint  functions.  In  some  special  cases,  however,  it  is  possible  to  handle 
convex  equality  constraint  functions,  i.e.,  constraints  of  the  form  h(x)  =  0,  where  h  is 
convex.  We  explore  this  idea  in  this  problem. 

Consider  the  optimization  problem 

minimize  fo{x) 

subject  to  fi{x)  <  0,  *  =  1, . . . ,  m  (4.65) 

h(x)  =  0, 

where  /.,  and  h  are  convex  functions  with  domain  R".  Unless  h  is  affine,  this  is  not  a 
convex  optimization  problem.  Consider  the  related  problem 

minimize  fo  (x) 

subject  to  f  i(x)<  0,  i  =  1, ,  m,  (4.66) 

h(x)  <  0, 

where  the  convex  equality  constraint  has  been  relaxed  to  a  convex  inequality.  This  prob¬ 
lem  is,  of  course,  convex. 

Now  suppose  we  can  guarantee  that  at  any  optimal  solution  x *  of  the  convex  prob¬ 
lem  (4.66),  we  have  h( x*)  =  0,  i.e.,  the  inequality  h(x)  <  0  is  always  active  at  the  solution. 
Then  we  can  solve  the  (nonconvex)  problem  (4.65)  by  solving  the  convex  problem  (4.66). 
Show  that  this  is  the  case  if  there  is  an  index  r  such  that 

•  fo  is  monotonically  increasing  in  xr 

•  fi , . . . ,  fm  are  nondecreasing  in  xr 

•  h  is  monotonically  decreasing  in  xr. 

We  will  see  specific  examples  in  exercises  4.31  and  4.58. 

4.7  Convex-concave  fractional  problems.  Consider  a  problem  of  the  form 

minimize  fo(x)/(cTx  +  d) 
subject  to  fi(x)  <  0,  *  =  1, . . . ,  m 

Ax  =  b 

where  fo,  fi, ... ,  fm  are  convex,  and  the  domain  of  the  objective  function  is  defined  as 
{iGdom/o  |  cTa:  +  d  >  0} . 

(a)  Show  that  this  is  a  quasiconvex  optimization  problem. 

(b)  Show  that  the  problem  is  equivalent  to 

minimize  go{y,t) 

subject  to  gi(y,  t)  <  0,  i  =  1, . . . ,  m 
Ay  =  bt 
cT  y  +  dt  =  1, 

where  gi  is  the  perspective  of  fi  (see  §3.2.6).  The  variables  are  y  £  Rn  and  t  £  R. 
Show  that  this  problem  is  convex. 

(c)  Following  a  similar  argument,  derive  a  convex  formulation  for  the  convex- concave 
fractional  problem 


minimize  fo{x)/h(x) 
subject  to  fi{x)  <0,  i  =  1, . . .  ,m 
Ax  =  b 
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where  fo,  fi,  ■  ■  ■ ,  fm  are  convex,  h  is  concave,  the  domain  of  the  objective  function 
is  defined  as  {x  £  dom/o  fl  dom/i  |  h(x )  >  0}  and  fo(x)  >  0  everywhere. 

As  an  example,  apply  your  technique  to  the  (unconstrained)  problem  with 

fo(x)  =  (tr  F(x))/m,  h(x)  =  (det(F(x))1/m, 

with  dom(/o/fe)  =  {*  |  F{x)  >~  0},  where  F(x)  =  Fo  +  xiFi  +  •  •  •  +  xnFn  for  given 
Fi  £  Sm.  In  this  problem,  we  minimize  the  ratio  of  the  arithmetic  mean  over  the 
geometric  mean  of  the  eigenvalues  of  an  affine  matrix  function  F(x). 

Linear  optimization  problems 

4.8  Some  simple  LPs.  Give  an  explicit  solution  of  each  of  the  following  LPs. 

(a)  Minimizing  a  linear  function  over  an  affine  set. 

minimize  cTx 
subject  to  Ax  =  b. 

(b)  Minimizing  a  linear  function  over  a  halfspace. 

minimize  cTx 
subject  to  aTx  <  b, 

where  a  ^  0. 

(c)  Minimizing  a  linear  function  over  a  rectangle. 

minimize  cTx 
subject  to  l  <  x  <  u, 

where  l  and  u  satisfy  Hu. 

(d)  Minimizing  a  linear  function  over  the  probability  simplex. 

minimize  cTx 

subject  to  lTx  =  1,  *  y  0. 

What  happens  if  the  equality  constraint  is  replaced  by  an  inequality  lTx  <  1? 

We  can  interpret  this  LP  as  a  simple  portfolio  optimization  problem.  The  vector 
x  represents  the  allocation  of  our  total  budget  over  different  assets,  with  xt  the 
fraction  invested  in  asset  i.  The  return  of  each  investment  is  fixed  and  given  by  —  a, 
so  our  total  return  (which  we  want  to  maximize)  is  —cTx.  If  we  replace  the  budget 
constraint  lTx  =  1  with  an  inequality  lTx  <  1,  we  have  the  option  of  not  investing 
a  portion  of  the  total  budget. 

(e)  Minimizing  a  linear  function  over  a  unit  box  with  a  total  budget  constraint. 

minimize  cTx 

subject  to  lTx  =  a,  0  H  x  H  1, 

where  a  is  an  integer  between  0  and  n.  What  happens  if  a  is  not  an  integer  (but 
satisfies  0  <  a  <  n)?  What  if  we  change  the  equality  to  an  inequality  1T x  <  a ? 

(f)  Minimizing  a  linear  function  over  a  unit  box  with  a  weighted  budget  constraint. 

minimize  cTx 

subject  to  dTx  =  a,  0  H  x  H  1, 
with  d  y  0,  and  0  <  a  <  lTd. 
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4.9  Square  LP.  Consider  the  LP 


minimize  cTx 
subject  to  Ax  <  b 


with  A  square  and  nonsingular.  Show  that  the  optimal  value  is  given  by 


cT  A  1b  A  Tc  A  0 
— oo  otherwise. 


4.10  Converting  general  LP  to  standard  form.  Work  out  the  details  on  page  147  of  §4.3. 
Explain  in  detail  the  relation  between  the  feasible  sets,  the  optimal  solutions,  and  the 
optimal  values  of  the  standard  form  LP  and  the  original  LP. 

4.11  Problems  involving  l\- and  loo-norms.  Formulate  the  following  problems  as  LPs.  Explain 
in  detail  the  relation  between  the  optimal  solution  of  each  problem  and  the  solution  of  its 
equivalent  LP. 


(a)  Minimize 

(b)  Minimize 

(c)  Minimize 

(d)  Minimize 

(e)  Minimize 


|| Ax  —  6||oo  (^oo-norm  approximation). 
|| Ax  —  6|| i  (f?i-norm  approximation). 

|| Ax  —  6||i  subject  to  ||®||oo  <  1. 

||as || i  subject  to  || Ax  —  6||oo  <  1. 

||  Ax  —  6||i  +  ||a:||oo. 


In  each  problem,  A  £  Rmx”  ancj  h  £  Rm  are  given.  (See  §6.1  for  more  problems  involving 
approximation  and  constrained  approximation.) 

4.12  Network  flow  problem.  Consider  a  network  of  n  nodes,  with  directed  links  connecting  each 
pair  of  nodes.  The  variables  in  the  problem  are  the  flows  on  each  link:  Xij  will  denote  the 
flow  from  node  i  to  node  j.  The  cost  of  the  flow  along  the  link  from  node  i  to  node  j  is 
given  by  CijXij,  where  dj  are  given  constants.  The  total  cost  across  the  network  is 


C  —  ^  '  djXij. 
i,i= i 

Each  link  flow  x%j  is  also  subject  to  a  given  lower  bound  Lj  (usually  assumed  to  be 
nonnegative)  and  an  upper  bound  Ujj . 

The  external  supply  at  node  i  is  given  by  bi,  where  bi  >  0  means  an  external  flow  enters 
the  network  at  node  i,  and  bi  <  0  means  that  at  node  i,  an  amount  |6i|  flows  out  of  the 
network.  We  assume  that  1  Tb  =  0,  i.e.,  the  total  external  supply  equals  total  external 
demand.  At  each  node  we  have  conservation  of  flow:  the  total  flow  into  node  i  along  links 
and  the  external  supply,  minus  the  total  flow  out  along  the  links,  equals  zero. 

The  problem  is  to  minimize  the  total  cost  of  flow  through  the  network,  subject  to  the 
constraints  described  above.  Formulate  this  problem  as  an  LP. 

4.13  Robust  LP  with  interval  coefficients.  Consider  the  problem,  with  variable  x  £  R", 

minimize  cTx 

subject  to  Ax  A  b  for  all  A  £  A, 


where  A  C  Rmxn  js  the  set 

A  =  {A  £  Rmxn  |  Ajj  -  Vij  <  Aij  <  Aij  +  Vij,  i  =  1, . . .  ,m,  j  =  1, . . . ,  n}. 

(The  matrices  A  and  V  are  given.)  This  problem  can  be  interpreted  as  an  LP  where  each 
coefficient  of  A  is  only  known  to  lie  in  an  interval,  and  we  require  that  x  must  satisfy  the 
constraints  for  all  possible  values  of  the  coefficients. 

Express  this  problem  as  an  LP.  The  LP  you  construct  should  be  efficient,  i.e.,  it  should 
not  have  dimensions  that  grow  exponentially  with  n  or  m. 


194 


4  Convex  optimization  problems 


4.14  Approximating  a  matrix  in  infinity  norm.  The  f?oo-norm  induced  norm  of  a  matrix  A  £ 
RmX",  denoted  ||^4||oo,  is  given  by 

ii /in  m*iioo  v^i  i 

Halloo  =  sup  -jj — j; — =  max  y  \aij\. 
xM  Halloo  — ' 

3=1 

This  norm  is  sometimes  called  the  max-row-sum  norm,  for  obvious  reasons  (see  §A.1.5). 
Consider  the  problem  of  approximating  a  matrix,  in  the  max-row-sum  norm,  by  a  linear 
combination  of  other  matrices.  That  is,  we  are  given  k+  1  matrices  Ao,  . .  . ,  Ak  £  Rmxn, 
and  need  to  find  x  £  Rfc  that  minimizes 

||/4o  +  xiAi  +  ■  ■  ■  +  XkAk  ||  oo  • 

Express  this  problem  as  a  linear  program.  Explain  the  significance  of  any  extra  variables 
in  your  LP.  Carefully  explain  how  your  LP  formulation  solves  this  problem,  e.g.,  what  is 
the  relation  between  the  feasible  set  for  your  LP  and  this  problem? 

4.15  Relaxation  of  Boolean  LP.  In  a  Boolean  linear  program,  the  variable  x  is  constrained  to 
have  components  equal  to  zero  or  one: 

minimize  cTx 

subject  to  Ax  X  b  (4.67) 

Xi  £  (0, 1},  i  =  1, . . .  ,n. 

In  general,  such  problems  are  very  difficult  to  solve,  even  though  the  feasible  set  is  finite 
(containing  at  most  2”  points). 

In  a  general  method  called  relaxation,  the  constraint  that  Xi  be  zero  or  one  is  replaced 
with  the  linear  inequalities  0  <  Xi  <  1: 

minimize  cTx 

subject  to  Ax  X  b  (4.68) 

0  <  Xi  <  %  i  =  1, . . . ,  n. 

We  refer  to  this  problem  as  the  LP  relaxation  of  the  Boolean  LP  (4.67).  The  LP  relaxation 
is  far  easier  to  solve  than  the  original  Boolean  LP. 

(a)  Show  that  the  optimal  value  of  the  LP  relaxation  (4.68)  is  a  lower  bound  on  the 
optimal  value  of  the  Boolean  LP  (4.67).  What  can  you  say  about  the  Boolean  LP 
if  the  LP  relaxation  is  infeasible? 

(b)  It  sometimes  happens  that  the  LP  relaxation  has  a  solution  with  Xi  £  {0, 1}.  What 
can  you  say  in  this  case? 

4.16  Minimum  fuel  optimal  control.  We  consider  a  linear  dynamical  system  with  state  x(t)  £ 
R",  t  =  0, . . . ,  N ,  and  actuator  or  input  signal  u(t)  £  R,  for  t  =  0, . . . ,  N  —  1.  The 
dynamics  of  the  system  is  given  by  the  linear  recurrence 

x(t  +  1)  =  Ax(t)  +  bu(t),  t  =  0, . . . ,  N  —  1, 

where  A  £  R"xn  and  b  £  R’1  are  given.  We  assume  that  the  initial  state  is  zero,  i.e., 
x(0)  =  0. 

The  minimum  fuel  optimal  control  problem  is  to  choose  the  inputs  u(0), . . . ,  u(N  —  1)  so 
as  to  minimize  the  total  fuel  consumed,  which  is  given  by 

N- 1 

t= 0 
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subject  to  the  constraint  that  x(N)  =  Xdea,  where  N  is  the  (given)  time  horizon,  and 
Xdes  £  Rn  is  the  (given)  desired  final  or  target  state.  The  function  /  :  R  — >  R  is  the  fuel 
use  map  for  the  actuator,  and  gives  the  amount  of  fuel  used  as  a  function  of  the  actuator 
signal  amplitude.  In  this  problem  we  use 


/(a) 


|a|  \a\  <  1 

2 1 a|  —  1  |a|  >  1. 


This  means  that  fuel  use  is  proportional  to  the  absolute  value  of  the  actuator  signal,  for 
actuator  signals  between  —1  and  1;  for  larger  actuator  signals  the  marginal  fuel  efficiency 
is  half. 

Formulate  the  minimum  fuel  optimal  control  problem  as  an  LP. 

4.17  Optimal  activity  levels.  We  consider  the  selection  of  n  nonnegative  activity  levels,  denoted 
xi, . . . ,  xn.  These  activities  consume  m  resources,  which  are  limited.  Activity  j  consumes 
AijXj  of  resource  i,  where  AtJ  are  given.  The  total  resource  consumption  is  additive,  so 
the  total  of  resource  i  consumed  is  c;  =  y ) AijXj.  (Ordinarily  we  have  Ay  >  0,  i.e., 
activity  j  consumes  resource  i.  But  we  allow  the  possibility  that  Ay  <  0,  which  means 
that  activity  j  actually  generates  resource  i  as  a  by-product.)  Each  resource  consumption 
is  limited:  we  must  have  d  <  c™ax,  where  c™ax  are  given.  Each  activity  generates  revenue, 
which  is  a  piecewise-linear  concave  function  of  the  activity  level: 


rj(xj) 


I'.iXj  0  <Xj<  gj 

Pj  Qi  +  Pj18c  {xj  -  <1, )  Xj  >  </, . 


Here  Pj  >  0  is  the  basic  price,  qj  >  0  is  the  quantity  discount  level,  and  pfISC  is  the 
quantity  discount  price,  for  (the  product  of)  activity  j.  (We  have  0  <  Pjisc  <  Pj .)  The 
total  revenue  is  the  sum  of  the  revenues  associated  with  each  activity,  i.e.,  y"_n  rj(xj). 
The  goal  is  to  choose  activity  levels  that  maximize  the  total  revenue  while  respecting  the 
resource  limits.  Show  how  to  formulate  this  problem  as  an  LP. 

4.18  Separating  hyperplanes  and  spheres.  Suppose  you  are  given  two  sets  of  points  in  Rn, 
{u1,  v2, . . . ,  vK  }  and  {w1,  w2, . . . ,  wL}.  Formulate  the  following  two  problems  as  LP  fea¬ 
sibility  problems. 

(a)  Determine  a  hyperplane  that  separates  the  two  sets,  i.e.,  find  a  £  Rn  and  b  £  R 
with  a  ^  0  such  that 

aT vl  <  b,  i  =  1, . . . ,  K,  aT wz  >  b,  i  =  1, . . . ,  L. 


Note  that  we  require  a  ^  0,  so  you  have  to  make  sure  that  your  formulation  excludes 
the  trivial  solution  a  =  0,  b  =  0.  You  can  assume  that 


rank 


vK  w1  w2 

1  1  1 


1 


=  n  +  1 


(i.e.,  the  affine  hull  of  the  K  +  L  points  has  dimension  n). 

(b)  Determine  a  sphere  separating  the  two  sets  of  points,  i.e.,  find  xc  £  R"  and  R  >  0 
such  that 


IK  —  xc\\2  <  R,  i  =  1, . . . ,  K,  IK  —  xc\\2>  R,  i  =  1, . . . ,  L. 

(Here  xc  is  the  center  of  the  sphere;  R  is  its  radius.) 

(See  chapter  8  for  more  on  separating  hyperplanes,  separating  spheres,  and  related  topics.) 
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4.19  Consider  the  problem 

minimize  \\Ax  —  b\\\/(cTx  +  d) 
subject  to  ||a:||oo  <  1, 

where  A  £  RmXTl,  b  £  Rm,  c  £  Rn,  and  d  £  R.  We  assume  that  d  >  ||c||i,  which  implies 
that  cT x  +  d  >  0  for  all  feasible  x. 

(a)  Show  that  this  is  a  quasiconvex  optimization  problem. 

(b)  Show  that  it  is  equivalent  to  the  convex  optimization  problem 

minimize  1 1 A  y  —  bt  1 1 1 

subject  to  Hj/IIoc  <  t 

cTy  +  dt  =  1, 


with  variables  y  £  R™,  t  £  R. 

4.20  Power  assignment  in  a  wireless  communication  system.  We  consider  n  transmitters  with 
powers  pi,. . .  ,pn  >  0,  transmitting  to  n  receivers.  These  powers  are  the  optimization 
variables  in  the  problem.  We  let  G  £  Rnxn  denote  the  matrix  of  path  gains  from  the 
transmitters  to  the  receivers;  Gij  >  0  is  the  path  gain  from  transmitter  j  to  receiver  i. 
The  signal  power  at  receiver  i  is  then  Si  =  Gupt,  and  the  interference  power  at  receiver  i 
is  p  =  Ylk^i  GikPk  ■  The  signal  to  interference  plus  noise  ratio,  denoted  SINR,  at  receiver 
i,  is  given  by  Si/(Ii  +  Oi),  where  c>i  >  0  is  the  (self-)  noise  power  in  receiver  i.  The 
objective  in  the  problem  is  to  maximize  the  minimum  SINR  ratio,  over  all  receivers,  i.e., 
to  maximize 

Si 

mm  — - . 

i=l,...,n  li  -(-  0~i 

There  are  a  number  of  constraints  on  the  powers  that  must  be  satisfied,  in  addition  to  the 
obvious  one  pi  >  0.  The  first  is  a  maximum  allowable  power  for  each  transmitter,  i.e., 
Pi  <  -P;max,  where  P,max  >  0  is  given.  In  addition,  the  transmitters  are  partitioned  into 
groups,  with  each  group  sharing  the  same  power  supply,  so  there  is  a  total  power  constraint 
for  each  group  of  transmitter  powers.  More  precisely,  we  have  subsets  K\, ,  Krn  of 
(1, . . . ,  n}  with  K\  U  •  •  •  U  Km  =  {1, . . . ,  n},  and  Kj  n  Ki  =  0  if  j  ^  l.  For  each  group  Ki, 
the  total  associated  transmitter  power  cannot  exceed  P;sp  >  0: 

^2  Pk  <  ^SP,  l  =  1, . . .  ,m. 

keKi 

Finally,  we  have  a  limit  Pff  >  0  on  the  total  received  power  at  each  receiver: 

n 

GikPk  <  PiC ,  i  =  1,  •  •  • ,  n. 

k= i 

(This  constraint  reflects  the  fact  that  the  receivers  will  saturate  if  the  total  received  power 
is  too  large.) 

Formulate  the  SINR  maximization  problem  as  a  generalized  linear-fractional  program. 

Quadratic  optimization  problems 

4.21  Some  simple  QCQPs.  Give  an  explicit  solution  of  each  of  the  following  QCQPs. 

(a)  Minimizing  a  linear  function  over  an  ellipsoid  centered  at  the  origin. 

minimize  cTx 
subject  to  xT Ax  <  1, 

where  A  £  S"+  and  c  ^  0.  What  is  the  solution  if  the  problem  is  not  convex 
(A  0  S")? 
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(b)  Minimizing  a  linear  function  over  an  ellipsoid. 

minimize  cTx 

subject  to  (x  —  xc)TA(x  —  xc)  <  1, 
where  A  £  S"+  and  c  ^  0. 

(c)  Minimizing  a  quadratic  form  over  an  ellipsoid  centered  at  the  origin. 

minimize  xT  Bx 
subject  to  xT Ax  <  1, 

where  A  £  S"+  and  B  £  S".  Also  consider  the  nonconvex  extension  with  B  0  S"  . 
(See  §B.l.) 

4.22  Consider  the  QCQP 

minimize  (l/2)a :TPx  +  qTx  +  r 

subject  to  xTx  <  1, 

with  P  £  S"  +  .  Show  that  x*  =  —  (P  +  \I)^1q  where  A  =  max{0,  A}  and  A  is  the  largest 
solution  of  the  nonlinear  equation 

qT{P  +  \I)~2q  =  l. 


4.23  t±-norm  approximation  via  QCQP.  Formulate  the  ^4-norm  approximation  problem 

minimize  ||  Ax  —  b\\4  =  1  (ajx  —  bi j4)1/4 

as  a  QCQP.  The  matrix  A  £  Rmx™  (with  rows  aj)  and  the  vector  b  £  Rm  are  given. 

4.24  Complex  t\~,  £2-  and  £00-norm  approximation.  Consider  the  problem 

minimize  \\Ax  —  b\\p, 

where  A  £  Cmxn,  b  £  Cm,  and  the  variable  is  a:  £  C”.  The  complex  £p-norm  is  dehned 
by 

\  i/p 

\Vi\P  j 

for  p  >  1,  and  ||j/||oo  =  maxi=i,...jm  \y-i\.  For  p  =  1,  2,  and  00,  express  the  complex  ^p-norm 
approximation  problem  as  a  QCQP  or  SOCP  with  real  variables  and  data. 

4.25  Linear  separation  of  two  sets  of  ellipsoids.  Suppose  we  are  given  K  +  L  ellipsoids 

£i  =  {PiU  +  q.i  |  ||m||2  <  1},  i  =  1,  •  •  • ,  K  +  L, 

where  Pi  £  S”.  We  are  interested  in  finding  a  hyperplane  that  strictly  separates  £1,  . . . , 
£k  from  £k+ 1,  •  •  • ,  £k+l,  i.e.,  we  want  to  compute  a  £  Rn,  6  £  R  such  that 

aT x  +  b  >  0  for  x  £  £\  U  •  •  •  U  £k,  aT x  +  b  <  0  for  x  £  £k+i  U  •  •  •  U  £k+l, 

or  prove  that  no  such  hyperplane  exists.  Express  this  problem  as  an  SOCP  feasibility 
problem. 

4.26  Hyperbolic  constraints  as  SOC  constraints.  Verify  that  x  £  R",  y,z  £  R  satisfy 

xTx  <  yz,  y  >  0,  2  >  0 

if  and  only  if 

2r 

<  y  +  z,  y  >  0,  2  >  0. 

■1 /  -  z  ~  y  - 

L  y  J  2 

Use  this  observation  to  cast  the  following  problems  as  SOCPs. 
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(a)  Maximizing  harmonic  mean. 

maximize  (X^i  V( aTx  ^  &i))  1  > 

with  domain  {x  \  Ax  >-  b},  where  aj  is  the  ith  row  of  A. 

(b)  Maximizing  geometric  mean. 

maximize  (J^[™  1(af  a;  —  6i)) 1-/m  , 

with  domain  {x  \  Ax  Y  &},  where  aj  is  the  ith  row  of  A. 

4.27  Matrix  fractional  minimization  via  SOCP.  Express  the  following  problem  as  an  SOCP: 

minimize  (Ax  +  b)T(I  +  B  diag(*)BT)_1(J4x  +  b) 
subject  to  x  y  0, 

with  A  £  Rmx”,  b  £  Rm,  B  £  R"4Xn.  The  variable  is  i  £  R". 

Hint.  First  show  that  the  problem  is  equivalent  to 

minimize  vTv  +  wT  diag(x)_1w 
subject  to  v  +  Bw  =  Ax  +  b 
x  y  0, 

with  variables  v  £  Rm,  w,x  £  Rn.  (If  Xi  =  0  we  interpret  wf/xi  as  zero  if  Wi  =  0  and  as 
oo  otherwise.)  Then  use  the  results  of  exercise  4.26. 

4.28  Robust  quadratic  programming.  In  §4.4.2  we  discussed  robust  linear  programming  as  an 
application  of  second-order  cone  programming.  In  this  problem  we  consider  a  similar 
robust  variation  of  the  (convex)  quadratic  program 

minimize  (l/2)a 'TPx  +  qTx  +  r 

subject  to  Ax  A  b. 

For  simplicity  we  assume  that  only  the  matrix  P  is  subject  to  errors,  and  the  other 
parameters  ( q ,  r,  A,  b)  are  exactly  known.  The  robust  quadratic  program  is  defined  as 

minimize  supFg£:((l/2  )xT  Px  +  qTx  +  r) 

subject  to  Ax  A  b 


where  £  is  the  set  of  possible  matrices  P. 

For  each  of  the  following  sets  £,  express  the  robust  QP  as  a  convex  problem.  Be  as  specific 
as  you  can.  If  the  problem  can  be  expressed  in  a  standard  form  ( e.g .,  QP,  QCQP,  SOCP, 
SDP),  say  so. 

(a)  A  finite  set  of  matrices:  £  =  {Pi, . . . ,  Pk},  where  Pi  £  S+,  i  =  1, . . . ,  K. 

(b)  A  set  specified  by  a  nominal  value  Po  €  S”  plus  a  bound  on  the  eigenvalues  of  the 
deviation  P  —  Po: 

£  =  {P  G  Sn  |  -7/  <  P  -  Pa  <  7 /} 
where  7  £  R  and  Po  £  S”, 

(c)  An  ellipsoid  of  matrices: 


£  = 


K 

Po  T  ^  '  PiUi 

i=l 


You  can  assume  Pi  £  S" ,  i  =  0, . . . ,  K. 
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4.29  Maximizing  probability  of  satisfying  a  linear  inequality.  Let  c  be  a  random  variable  in  R™, 
normally  distributed  with  mean  c  and  covariance  matrix  R.  Consider  the  problem 

maximize  prob(cTx  >  a) 
subject  to  Fx  A  g,  Ax  =  b. 

Assuming  there  exists  a  feasible  point  x  for  which  cTx  >  a,  show  that  this  problem  is 
equivalent  to  a  convex  or  quasiconvex  optimization  problem.  Formulate  the  problem  as  a 
QP,  QCQP,  or  SOCP  (if  the  problem  is  convex),  or  explain  how  you  can  solve  it  by  solving 
a  sequence  of  QP,  QCQP,  or  SOCP  feasibility  problems  (if  the  problem  is  quasiconvex). 

Geometric  programming 

4.30  A  heated  fluid  at  temperature  T  (degrees  above  ambient  temperature)  flows  in  a  pipe 
with  fixed  length  and  circular  cross  section  with  radius  r.  A  layer  of  insulation,  with 
thickness  w  r,  surrounds  the  pipe  to  reduce  heat  loss  through  the  pipe  walls.  The 
design  variables  in  this  problem  are  T,  r,  and  w. 

The  heat  loss  is  (approximately)  proportional  to  Tr/w,  so  over  a  fixed  lifetime,  the  energy 
cost  due  to  heat  loss  is  given  by  ot\ Tr/w.  The  cost  of  the  pipe,  which  has  a  fixed  wall 
thickness,  is  approximately  proportional  to  the  total  material,  i.e.,  it  is  given  by  a2r.  The 
cost  of  the  insulation  is  also  approximately  proportional  to  the  total  insulation  material, 
i.e.,  ct3rw  (using  w  <C  r).  The  total  cost  is  the  sum  of  these  three  costs. 

The  heat  flow  down  the  pipe  is  entirely  due  to  the  flow  of  the  fluid,  which  has  a  fixed 
velocity,  i.e.,  it  is  given  by  04 Tr2.  The  constants  cu  are  all  positive,  as  are  the  variables 
T,  r,  and  w. 

Now  the  problem:  maximize  the  total  heat  flow  down  the  pipe,  subject  to  an  upper  limit 
Cmax  on  total  cost,  and  the  constraints 

/  miri  '' /  'n  /  max  •  ?'mi  n  T  T  rmax,  ^min  '''■  X X '  Wmax,  XV  T  0.1?'. 

Express  this  problem  as  a  geometric  program. 

4.31  Recursive  formulation  of  optimal  beam  design  problem.  Show  that  the  GP  (4.46)  is  equiv¬ 
alent  to  the  GP 

minimize  wihi 

subject  to  Wi/w max  <  1,  Wmin/Wi<l,  i  =  1, . . . ,  N 

hi  /  /?max  fn  1 ,  hm  in  /  hi  ^  1 ,  i  —  1,...,  N 

hi  j  ('WiSmax')  1,  SminUJi  j hi  fn  1 ,  i  —  1,  .  .  .  ,  N 

6iF/ (ama,xWihi)  <  1,  i  =  l,...,N 

(2*  -  1  )di/vi  +  Vi+i/vi  <1,  i  =  1, . . . ,  N 

(i  -  l/3)di/yi  + Vi+i/yi +yi+i/yi  <1,  i  =  l,...,N 

yi  /l/max  1 

Ewihidi/(6F)  =  1,  i  =  l,...,N. 

The  variables  are  Wi,  hi,  Vi,  di,  yi  for  i  =  1, _ ,  N. 

4.32  Approximating  a  function  as  a  monomial.  Suppose  the  function  /  :  R“  — >  R  is  differ¬ 
entiable  at  a  point  xo  >-  0,  with  f(x o)  >  0.  How  would  you  find  a  monomial  function 
/  :  R“  — »  R  such  that  f(x <,)  =  f(x o)  and  for  x  near  xo,  f(x)  is  very  near  /(x)? 

4.33  Express  the  following  problems  as  convex  optimization  problems. 

(a)  Minimize  max{p(x),  q(x)},  where  p  and  q  are  posynomials. 

(b)  Minimize  exp(p(x))  +  exp(q(x)),  where  p  and  q  are  posynomials. 

(c)  Minimize  p(x)/(r(x)  —  q(x)),  subject  to  r(x)  >  q(x),  where  p,q  are  posynomials, 
and  r  is  a  monomial. 
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4.34  Log-convexity  of  Perron- Frobenius  eigenvalue.  Let  A  £  Rnxn  kg  an  elementwise  positive 
matrix,  i.e.,  At:j  >  0.  (The  results  of  this  problem  hold  for  irreducible  nonnegative 
matrices  as  well.)  Let  Apf(A)  denotes  its  Perron-Frobenius  eigenvalue,  i.e.,  its  eigenvalue 
of  largest  magnitude.  (See  the  definition  and  the  example  on  page  165.)  Show  that 
logApf(A)  is  a  convex  function  of  log  Aij.  This  means,  for  example,  that  we  have  the 
inequality 

Apf(C)  <  (Apf(A)Apf(B))1/2, 

where  Cij  =  (AijBij)1^2 ,  and  A  and  B  are  elementwise  positive  matrices. 

Hint.  Use  the  characterization  of  the  Perron-Frobenius  eigenvalue  given  in  (4.47),  or, 
alternatively,  use  the  characterization 

log  Apf(Al)  =  lim  (1/fc)  log(lTAfcl). 


4.35  Signomial  and  geometric  programs.  A  signomial  is  a  linear  combination  of  monomials  of 
some  positive  variables  xi, . . .  ,xn-  Signomials  are  more  general  than  posynomials,  which 
are  signomials  with  all  positive  coefficients.  A  signomial  program  is  an  optimization 
problem  of  the  form 


minimize  fo(x) 

subject  to  fi(x)  <  0,  i  =  1, . . . ,  m 
hi(x)  =  0,  i  =  1, . . .  ,p, 

where  /o,  •  •  • ,  fm  and  hi, . . . ,  hp  are  signomials.  In  general,  signomial  programs  are  very 
difficult  to  solve. 

Some  signomial  programs  can  be  transformed  to  GPs,  and  therefore  solved  efficiently. 
Show  how  to  do  this  for  a  signomial  program  of  the  following  form: 

•  The  objective  signomial  fo  is  a  posynomial,  i.e.,  its  terms  have  only  positive  coeffi¬ 
cients. 

•  Each  inequality  constraint  signomial  f\, ... ,  fm  has  exactly  one  term  with  a  negative 
coefficient:  /,  =  p,  —  qi  where  pi  is  posynomial,  and  qt  is  monomial. 

•  Each  equality  constraint  signomial  hi, ...  ,hp  has  exactly  one  term  with  a  positive 
coefficient  and  one  term  with  a  negative  coefficient:  hi  =  ri  —  Si  where  r;  and  s;  are 
monomials. 

4.36  Explain  how  to  reformulate  a  general  GP  as  an  equivalent  GP  in  which  every  posynomial 
(in  the  objective  and  constraints)  has  at  most  two  monomial  terms.  Hint.  Express  each 
sum  (of  monomials)  as  a  sum  of  sums,  each  with  two  terms. 

4.37  Generalized  posynomials  and  geometric  programming.  Let  xi , . . . ,  x„  be  positive  variables, 
and  suppose  the  functions  /;  :  R’1  — >  R,  i  =  1, . . . ,  k,  are  posynomials  of  xi, . . . ,  xn-  If 
4>  :  Rfc  ->  R  is  a  polynomial  with  nonnegative  coefficients,  then  the  composition 

h(x)  =  (p{fi{x),...,fk{x))  (4.69) 

is  a  posynomial,  since  posynomials  are  closed  under  products,  sums,  and  multiplication 
by  nonnegative  scalars.  For  example,  suppose  fi  and  f2  are  posynomials,  and  consider 
the  polynomial  </>(«i,  22)  =  3z2Z2  +  2zi  +  3 z\  (which  has  nonnegative  coefficients).  Then 
h  =  3/1/2  +  2/i  +  f 2  is  a  posynomial. 

In  this  problem  we  consider  a  generalization  of  this  idea,  in  which  <f>  is  allowed  to  be 
a  posynomial,  i.e.,  can  have  fractional  exponents.  Specifically,  assume  that  <j>  :  Rfc  — >■ 
R  is  a  posynomial,  with  all  its  exponents  nonnegative.  In  this  case  we  will  call  the 
function  h  defined  in  (4.69)  a  generalized  posynomial.  As  an  example,  suppose  /1  and  /2 
are  posynomials,  and  consider  the  posynomial  (with  nonnegative  exponents)  <f{zi,Z2)  = 
2zi'3Z2'2  +  Z1Z2'5  +  2.  Then  the  function 

h(x )  =  2fi(x)03f2{x)12  +  fi(x)f2(x)05  +  2 
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is  a  generalized  posynomial.  Note  that  it  is  not  a  posynomial,  however  (unless  /i  and  fc 
are  monomials  or  constants). 

A  generalized  geometric  program  (GGP)  is  an  optimization  problem  of  the  form 
minimize  ho(x) 

subject  to  hi(x)  <1,  i  =  1, . . .  ,m  (4.70) 

Qi{x)  =  1,  i  =  l,...,p, 

where  gi, ...  ,gv  are  monomials,  and  ho, ... ,  hm  are  generalized  posynomials. 

Show  how  to  express  this  generalized  geometric  program  as  an  equivalent  geometric  pro¬ 
gram.  Explain  any  new  variables  you  introduce,  and  explain  how  your  GP  is  equivalent 
to  the  GGP  (4.70). 


Semidefinite  programming  and  conic  form  problems 

4.38  LMIs  and  SDPs  with  one  variable.  The  generalized  eigenvalues  of  a  matrix  pair  ( A,  B ), 
where  A,  B  £  S",  are  defined  as  the  roots  of  the  polynomial  det(A£>  —  A)  (see  §A.5.3). 
Suppose  B  is  nonsingular,  and  that  A  and  B  can  be  simultaneously  diagonalized  by  a 
congruence,  i.e.,  there  exists  a  nonsingular  R  £  R"x"  such  that 

Rt  AR  =  diag(a),  RT  BR  =  diag(b), 

where  a,b  £  Rn.  (A  sufficient  condition  for  this  to  hold  is  that  there  exists  ti,  £2  such 
that  ti  A  +  t^B  y  0.) 

(a)  Show  that  the  generalized  eigenvalues  of  (A,  B)  are  real,  and  given  by  Ai  =  ai/bi, 
i  =  1, . . . ,  n. 

(b)  Express  the  solution  of  the  SDP 

minimize  ct 
subject  to  tB  <  A, 

with  variable  t  £  R,  in  terms  of  a  and  b. 

4.39  SDPs  and  congruence  transformations.  Consider  the  SDP 

minimize  cTx 

subject  to  aq.Fi  +  X2F2  +  •  •  •  +  xnFn  +  G  X  0, 


with  Fi,  G  £  Sk,  c  £  R". 


(a)  Suppose  R  £  Rfcxfc  is  nonsingular.  Show  that  the  SDP  is  equivalent  to  the  SDP 
minimize  cTx 

subject  to  aq.Fi  +  X2F2  +  •  •  •  +  xnFn  +  G  <  0, 


where  F)  =  RT FiR,  G  =  RTGR. 

(b)  Suppose  there  exists  a  nonsingular  R  such  that  F,  and  G  are  diagonal.  Show  that 
the  SDP  is  equivalent  to  an  LP. 

(c)  Suppose  there  exists  a  nonsingular  R  such  that  F)  and  G  have  the  form 


Fi  = 


ail  ai 
T 

Cli  CXi 


i  =  1, . . .  ,n, 


G  = 


/3I  b 
bT  p 


where  Qi,  /3  £  R,  ai,  b  £  Rfe  1 .  Show  that  the  SDP  is  equivalent  to  an  SOCP  with 
a  single  second-order  cone  constraint. 
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4.40  LPs,  QPs,  QCQPs,  and  SOCPs  as  SDPs.  Express  the  following  problems  as  SDPs. 

(a)  The  LP  (4.27). 

(b)  The  QP  (4.34),  the  QCQP  (4.35)  and  the  SOCP  (4.36).  Hint.  Suppose  A  £  S!j_+, 
C  £  Ss,  and  B  £  RrXs.  Then 

\  nT  B  1  >r  0  <*=*  c  -  bta~1b  t  o. 


For  a  more  complete  statement,  which  applies  also  to  singular  A,  and  a  proof, 
see  §A.5.5. 

(c)  The  matrix  fractional  optimization  problem 

minimize  {Ax  +  b)T  F{x)~1  {Ax  +  b) 
where  A  £  Rmx”,  b  £  Rm, 

F{x)  =  F0  +  xiFi  H - h  xnFn, 

with  Fi  £  Sm,  and  we  take  the  domain  of  the  objective  to  be  {*  |  F{x)  >-  0}.  You 
can  assume  the  problem  is  feasible  (there  exists  at  least  one  x  with  F{x)  >-  0). 

4.41  LMI  tests  for  copositive  matrices  and  Po-matrices.  A  matrix  A  £  S"  is  said  to  be  copositive 
if  xT Ax  >  0  for  all  x  t  0  (see  exercise  2.35).  A  matrix  A  £  Rnxn  is  said  to  be  a  Po- 
matrix  if  max,=i ,...,nXi{Ax)i  >  0  for  all  x.  Checking  whether  a  matrix  is  copositive  or 
a  Po-matrix  is  very  difficult  in  general.  However,  there  exist  useful  sufficient  conditions 
that  can  be  verified  using  semidefinite  programming. 

(a)  Show  that  A  is  copositive  if  it  can  be  decomposed  as  a  sum  of  a  positive  semidefinite 
and  an  elementwise  nonnegative  matrix: 

A  =  B  +  C,  B>  0,  Cij>  0,  i,j  =  l,...,n.  (4.71) 

Express  the  problem  of  finding  B  and  C  that  satisfy  (4.71)  as  an  SDP  feasibility 
problem. 

(b)  Show  that  A  is  a  Po-matrix  if  there  exists  a  positive  diagonal  matrix  D  such  that 

DA  +  AtD  t  0.  (4.72) 

Express  the  problem  of  finding  a  D  that  satisfies  (4.72)  as  an  SDP  feasibility  problem. 

4.42  Complex  LMIs  and  SDPs.  A  complex  LMI  has  the  form 

XlFl  +  •  •  •  +  xnFn  +  G  A  0 

where  Pi, . . . ,  Fn,  G  are  complex  n  x  n  Hermitian  matrices,  i.e.,  F*1  =  Fi,  GH  =  G ,  and 
x  £  R"  is  a  real  variable.  A  complex  SDP  is  the  problem  of  minimizing  a  (real)  linear 
function  of  x  subject  to  a  complex  LMI  constraint. 

Complex  LMIs  and  SDPs  can  be  transformed  to  real  LMIs  and  SDPs,  using  the  fact  that 


X  to 


'  RX  -AX 

ax  mx 


h  0, 


where  5RX  £  R,,  x™  is  the  real  part  of  the  complex  Hermitian  matrix  X,  and  AX  £  R’ixn 
is  the  imaginary  part  of  X. 

Verify  this  result,  and  show  how  to  pose  a  complex  SDP  as  a  real  SDP. 
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4.43  Eigenvalue  optimization  via  SDP.  Suppose  A  :  R71  — »  Sm  is  affine,  i.e., 


A{x)  —  Aq  +  Xl  Ai  +  •  •  •  +  xn  An 

where  Ai  £  Sm.  Let  Ai(a;)  >  Mix)  >  •  •  •  >  A m(x)  denote  the  eigenvalues  of  A(x).  Show 
how  to  pose  the  following  problems  as  SDPs. 

(a)  Minimize  the  maximum  eigenvalue  Ai(x). 

(b)  Minimize  the  spread  of  the  eigenvalues,  Ai(a;)  —  \m(x). 

(c)  Minimize  the  condition  number  of  A(x),  subject  to  A(x)  >-  0.  The  condition  number 
is  defined  as  k(A(x))  =  \i(x)/\m(x),  with  domain  {x  \  A(x)  >-  0}.  You  may  assume 
that  A(x)  y  0  for  at  least  one  x. 

Hint.  You  need  to  minimize  A/7,  subject  to 

0  -<  7/  A  A(x)  A  A I. 

Change  variables  to  y  =  x/'y,  t  =  A/7,  s  =  I/7. 

(d)  Minimize  the  sum  of  the  absolute  values  of  the  eigenvalues,  |  Ai  (as)  |  +  •  •  •  +  |Am(x)|. 
Hint.  Express  A(x)  as  A(x)  =  A+  —  A-,  where  A+  y  0,  A-  y  0. 

4.44  Optimization  over  polynomials.  Pose  the  following  problem  as  an  SDP.  Find  the  polyno¬ 
mial  p  :  R  — »  R, 

p{t)  =  xi  +  X2t  H - +  x2k+it2k, 

that  satisfies  given  bounds  Li  <  p{ti)  <  Ui,  at  m  specified  points  ti,  and,  of  all  the 
polynomials  that  satisfy  these  bounds,  has  the  greatest  minimum  value: 

maximize  inft  p(t) 

subject  to  h  <  p(U)  <  Ui,  i  =  1, . . . ,  m. 

The  variables  are  x  £  R2fc+1. 

Hint.  Use  the  LMI  characterization  of  nonnegative  polynomials  derived  in  exercise  2.37, 
part  (b). 

4.45  [NesOO,  ParOO]  Sum-of-squares  representation  via  LMIs.  Consider  a  polynomial  p  :  R"  — » 
R  of  degree  2k.  The  polynomial  is  said  to  be  positive  semidefinite  (PSD)  if  p(x)  >  0 
for  all  x  £  R".  Except  for  special  cases  ( e.g .,  n  =  1  or  k  =  1),  it  is  extremely  difficult 
to  determine  whether  or  not  a  given  polynomial  is  PSD,  let  alone  solve  an  optimization 
problem,  with  the  coefficients  of  p  as  variables,  with  the  constraint  that  p  be  PSD. 

A  famous  sufficient  condition  for  a  polynomial  to  be  PSD  is  that  it  have  the  form 


r 

p{x)  =  '^2<li(x)2, 

i=  1 

for  some  polynomials  qi,  with  degree  no  more  than  k.  A  polynomial  p  that  has  this 
sum-of-squares  form  is  called  SOS. 

The  condition  that  a  polynomial  p  be  SOS  (viewed  as  a  constraint  on  its  coefficients) 
turns  out  to  be  equivalent  to  an  LMI,  and  therefore  a  variety  of  optimization  problems, 
with  SOS  constraints,  can  be  posed  as  SDPs.  You  will  explore  these  ideas  in  this  problem. 

(a)  Let  fi , . . . ,  fs  be  all  monomials  of  degree  k  or  less.  (Here  we  mean  monomial  in 
the  standard  sense,  i.e.,  x™1  ■  ■  where  mi  £  Z+,  and  not  in  the  sense  used  in 

geometric  programming.)  Show  that  if  p  can  be  expressed  as  a  positive  semidefinite 
quadratic  form  p  =  fTV /,  with  V  £  S^_,  then  p  is  SOS.  Conversely,  show  that  if 
p  is  SOS,  then  it  can  be  expressed  as  a  positive  semidefinite  quadratic  form  in  the 
monomials,  i.e.,  p  =  fTV f,  for  some  V  £  Si/. 
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(b)  Show  that  the  condition  p  =  fTV  f  is  a  set  of  linear  equality  constraints  relating  the 
coefficients  of  p  and  the  matrix  V.  Combined  with  part  (a)  above,  this  shows  that 
the  condition  that  p  be  SOS  is  equivalent  to  a  set  of  linear  equalities  relating  V  and 
the  coefficients  of  p,  and  the  matrix  inequality  V  y  0. 

(c)  Work  out  the  LMI  conditions  for  SOS  explicitly  for  the  case  where  p  is  polynomial 
of  degree  four  in  two  variables. 

4.46  Multidimensional  moments.  The  moments  of  a  random  variable  t  on  R2  are  defined  as 
fMj  =  where  i,j  are  nonnegative  integers.  In  this  problem  we  derive  necessary 

conditions  for  a  set  of  numbers  /ry,  0  <  i,  j  <  2k,  i  +  j  <  2k,  to  be  the  moments  of  a 
distribution  on  R2. 

Let  p  :  R2  ->  R  be  a  polynomial  of  degree  k  with  coefficients  cy, 


k  k  —  i 

p(t)  = 

i=0  j= 0 

and  let  t  be  a  random  variable  with  moments  ptj.  Suppose  c  £  j^(fc+i)0+2)/2  con^ajns 
the  coefficients  cy  in  some  specific  order,  and  p  £  R_(fe+1)(2fc+1)  contains  the  moments  pij 
in  the  same  order.  Show  that  Ep(f)2  can  be  expressed  as  a  quadratic  form  in  c: 

Ep(f)2  =  cTH(p)c, 

where  H  :  R(fc+1)(2fc+1)  — >  g(fc+i)(fc+2)/2  a  ijnear  function  of  p.  From  this,  conclude 
that  p  must  satisfy  the  LMI  H(p)  y  0. 

Remark:  For  random  variables  on  R,  the  matrix  H  can  be  taken  as  the  Hankel  matrix 
defined  in  (4.52).  In  this  case,  H(p)  y  0  is  a  necessary  and  sufficient  condition  for  p  to  be 
the  moments  of  a  distribution,  or  the  limit  of  a  sequence  of  moments.  On  R2,  however, 
the  LMI  is  only  a  necessary  condition. 

4.47  Maximum  determinant  positive  semidefinite  matrix  completion.  We  consider  a  matrix 
A  £  S",  with  some  entries  specified,  and  the  others  not  specified.  The  positive  semidefinite 
matrix  completion  problem  is  to  determine  values  of  the  unspecified  entries  of  the  matrix 
so  that  Ay  0  (or  to  determine  that  such  a  completion  does  not  exist). 

(a)  Explain  why  we  can  assume  without  loss  of  generality  that  the  diagonal  entries  of 
A  are  specified. 

(b)  Show  how  to  formulate  the  positive  semidefinite  completion  problem  as  an  SDP 
feasibility  problem. 

(c)  Assume  that  A  has  at  least  one  completion  that  is  positive  definite,  and  the  diag¬ 
onal  entries  of  A  are  specified  ( i.e .,  fixed).  The  positive  definite  completion  with 
largest  determinant  is  called  the  maximum  determinant  completion.  Show  that  the 
maximum  determinant  completion  is  unique.  Show  that  if  A *  is  the  maximum  de¬ 
terminant  completion,  then  (A*)-1  has  zeros  in  all  the  entries  of  the  original  matrix 
that  were  not  specified.  Hint.  The  gradient  of  the  function  f(X)  =  logdetX  is 
V/pO  =  X-1  (see  §A.4.1). 

(d)  Suppose  A  is  specified  on  its  tridiagonal  part,  i.e.,  we  are  given  An, . . . ,  Ann  and 
A12, . . . ,  An- i,n.  Show  that  if  there  exists  a  positive  definite  completion  of  A,  then 
there  is  a  positive  definite  completion  whose  inverse  is  tridiagonal. 

4.48  Generalized  eigenvalue  minimization.  Recall  (from  example  3.37,  or  §A.5.3)  that  the 
largest  generalized  eigenvalue  of  a  pair  of  matrices  (A,  B )  £  Sfc  x  S+_|_  is  given  by 

vF  Au 

Amax(A,  B )  =  sup  ‘  =  max{A  |  det(AR  —  A)  =  0}. 

u^o  ^  Bu 

As  we  have  seen,  this  function  is  quasiconvex  (if  we  take  Sfc  x  S(j_+  as  its  domain). 


Exercises 


205 


We  consider  the  problem 


minimize  Ama x(A(x),B(x)) 
where  A,  B  :  R"  — >  Sk  are  affine  functions,  defined  as 


(4.73) 


A(x)  —  Aq  +  X\A\  +  ■  •  •  +  xnAn ,  B{x)  —  B0  +  X1B1  +  •  •  •  +  xnBn. 
with  Ai,Bi  £  Sfc. 

(a)  Give  a  family  of  convex  functions  (j>t  :  Sfc  x  Sk  — >  R,  that  satisfy 

Amax(A,  B)  <t  ^  4>t{A,B)<  0 

for  all  (A,  B)  £  Sk  x  S(j_+.  Show  that  this  allows  us  to  solve  (4.73)  by  solving  a 
sequence  of  convex  feasibility  problems. 

(b)  Give  a  family  of  matrix-convex  functions  <f>t  :  Sk  x  Sfc  — >  Sk  that  satisfy 

Ama frA,B)<t  <S=>  $t(A,B)^0 

for  all  (A,  B)  £  Sk  x  S++.  Show  that  this  allows  us  to  solve  (4.73)  by  solving  a 
sequence  of  convex  feasibility  problems  with  LMI  constraints. 

(c)  Suppose  B(x)  =  (aT x  +  b)I ,  with  a  ^  0.  Show  that  (4.73)  is  equivalent  to  the  convex 
problem 

minimize  Amax(sJ4o  +  yiA1  -| - 1-  ynAn) 

subject  to  aTy  +  bs  =  1 
s  >  0, 

with  variables  y  £  Rn,  s£R. 

4.49  Generalized  fractional  programming.  Let  K  £  Rm  be  a  proper  cone.  Show  that  the 
function  fo  :  Rn  — >  Rm,  defined  by 

fo(x)  =  inf{t  |  Cx  +  d  <k  t(Fx  +  g)},  dom  fo  =  {x\  Fx  +  g  >k  0}, 
with  C,F  £  Rmx”,  d,g£  Rm,  is  quasiconvex. 

A  quasiconvex  optimization  problem  with  objective  function  of  this  form  is  called  a  gen¬ 
eralized  fractional  program.  Express  the  generalized  linear-fractional  program  of  page  152 
and  the  generalized  eigenvalue  minimization  problem  (4.73)  as  generalized  fractional  pro¬ 
grams. 

Vector  and  multicriterion  optimization 

4.50  Bi  -criterion  optimization.  Figure  4.11  shows  the  optimal  trade-off  curve  and  the  set  of 
achievable  values  for  the  bi-criterion  optimization  problem 

minimize  (w.r.t.  R+)  {\\Ax-b\\\\\x\\l), 

for  some  A  £  R100xl°,  b  £  R100.  Answer  the  following  questions  using  information  from 
the  plot.  We  denote  by  *is  the  solution  of  the  least-squares  problem 

minimize  ||Ahr  —  6|||. 


(a)  What  is  ||jcia|j2? 

(b)  What  is  ||Aa;is  —  6||2? 

(c)  What  is  ||&||2? 
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(d)  Give  the  optimal  value  of  the  problem 


minimize  ||  Ax  —  6||| 
subject  to  ||a;|||  =  1- 

(e)  Give  the  optimal  value  of  the  problem 

minimize  ||  Ax  —  bill 
subject  to  ||a:|||  <  !■ 

(f)  Give  the  optimal  value  of  the  problem 

minimize  || Ax  —  6|||  +  ||a:||| . 


(g)  What  is  the  rank  of  A ? 

4.51  Monotone  transformation  of  objective  in  vector  optimization.  Consider  the  vector  opti¬ 
mization  problem  (4.56).  Suppose  we  form  a  new  vector  optimization  problem  by  replacing 
the  objective  fo  with  o  /0,  where  (j>  :  R9  — >  R9  satisfies 

u  <k  v,  u  ^  v  =>  <p{u)  <k  rf)(v ),  cp(u)  ^  <p(v). 

Show  that  a  point  x  is  Pareto  optimal  (or  optimal)  for  one  problem  if  and  only  if  it  is 
Pareto  optimal  (optimal)  for  the  other,  so  the  two  problems  are  equivalent.  In  particular, 
composing  each  objective  in  a  multicriterion  problem  with  an  increasing  function  does 
not  affect  the  Pareto  optimal  points. 

4.52  Pareto  optimal  points  and  the  boundary  of  the  set  of  achievable  values.  Consider  a  vector 
optimization  problem  with  cone  K.  Let  V  denote  the  set  of  Pareto  optimal  values,  and 
let  O  denote  the  set  of  achievable  objective  values.  Show  that  PCOn  bd(D,  i.e.,  every 
Pareto  optimal  value  is  an  achievable  objective  value  that  lies  in  the  boundary  of  the  set 
of  achievable  objective  values. 

4.53  Suppose  the  vector  optimization  problem  (4.56)  is  convex.  Show  that  the  set 

A  =  0  +  K  =  {t£R.q\  fo(x)  ~<k  t  for  some  feasible  x}, 

is  convex.  Also  show  that  the  minimal  elements  of  A  are  the  same  as  the  minimal  points 
of  O. 

4.54  Scalarization  and  optimal  points.  Suppose  a  (not  necessarily  convex)  vector  optimization 
problem  has  an  optimal  point  x* .  Show  that  x *  is  a  solution  of  the  associated  scalarized 
problem  for  any  choice  of  A  >~k *  0.  Also  show  the  converse:  If  a  point  *  is  a  solution  of 
the  scalarized  problem  for  any  choice  of  A  >~k *  0,  then  it  is  an  optimal  point  for  the  (not 
necessarily  convex)  vector  optimization  problem. 

4.55  Generalization  of  weighted-sum  scalarization.  In  §4.7.4  we  showed  how  to  obtain  Pareto 
optimal  solutions  of  a  vector  optimization  problem  by  replacing  the  vector  objective  fo  : 
R"  — y  R9  with  the  scalar  objective  AT/o,  where  A  >~k *  0.  Let  ip  :  R9  — >  R  be  a 
A'-increasing  function,  i.e.,  satisfying 

u  <k  v,  u  ^  v  =4'  ip(u)  <  ip(v). 

Show  that  any  solution  of  the  problem 

minimize  ip(fo(x)) 
subject  to  fi(x)  <  0,  i  =  1, . . . ,  m 
hi( x)  =  0,  i  =  1, . . .  ,p 
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is  Pareto  optimal  for  the  vector  optimization  problem 

minimize  (w.r.t.  K )  fo(x) 
subject  to  fi(x)  <  0,  i  =  1, . . .  ,m 

hi( x)  =  0,  i  =  1, . . .  ,p. 

Note  that  tp(u)  =  A Tu,  where  A  >-k*  0,  is  a  special  case. 

As  a  related  example,  show  that  in  a  multicriterion  optimization  problem  (ie.,  a  vector 
optimization  problem  with  f0  =  F:  R"  — »  R9,  and  K  =  Rl),  a  unique  solution  of  the 
scalar  optimization  problem 

minimize  maxj=il...l9  Fi(x) 
subject  to  fi(x)  <  0,  i  =  1, . . . ,  m 
hi(x)  =0,  i  =  1, . . .  ,p, 


is  Pareto  optimal. 

Miscellaneous  problems 

4.56  [P.  Parrilo]  We  consider  the  problem  of  minimizing  the  convex  function  fo  :  R"  — >  R 

over  the  convex  hull  of  the  union  of  some  convex  sets,  conv  Ci ).  These  sets  are 

described  via  convex  inequalities, 

Ci  =  {x  |  fij(x)  <0,  j  =  1, . . . ,  ki}, 

where  /t)  :  Rn  — »  R  are  convex.  Our  goal  is  to  formulate  this  problem  as  a  convex 
optimization  problem. 

The  obvious  approach  is  to  introduce  variables  xi, . . .  ,xq  £  Rn,  with  Xi  £  Ci,  9  £  R9 
with  9  >  0,  1 T 9  =  1,  and  a  variable  x  £  Rn,  with  x  =  9\X\  +  •  •  •  +  9qxq.  This  equality 
constraint  is  not  affine  in  the  variables,  so  this  approach  does  not  yield  a  convex  problem. 
A  more  sophisticated  formulation  is  given  by 

minimize  fo(x) 

subject  to  Sifij(zifsi)  <  0,  i  =  l,...,q,  j  =  l,...,ki 
lTs  =  l,  sF  0 

X  =  Zl^ - h  Zg, 

with  variables  zi,...,zg  £  Rn,  x  £  R",  and  si,...,sq  £  R.  (When  Si  =  0,  we  take 
Sifij(zi/ Si)  to  be  0  if  Zt  =  0  and  oo  if  Zi  0.)  Explain  why  this  problem  is  convex,  and 
equivalent  to  the  original  problem. 

4.57  Capacity  of  a  communication  channel.  We  consider  a  communication  channel,  with  input 
X (t)  £  {1, . . . ,  n},  and  output  Y(t)  £  {1, . . . ,  m},  for  t  =  1,2,...  (in  seconds,  say).  The 
relation  between  the  input  and  the  output  is  given  statistically: 

Pij  -  prob(Y(i)  =  i\X{t)  =  j),  i  =  l,...,m,  j  =  l,...,n. 

The  matrix  P  £  Rmx"  is  called  the  channel  transition  matrix ,  and  the  channel  is  called 
a  discrete  memoryless  channel. 

A  famous  result  of  Shannon  states  that  information  can  be  sent  over  the  communication 
channel,  with  arbitrarily  small  probability  of  error,  at  any  rate  less  than  a  number  C, 
called  the  channel  capacity ,  in  bits  per  second.  Shannon  also  showed  that  the  capacity  of 
a  discrete  memoryless  channel  can  be  found  by  solving  an  optimization  problem.  Assume 
that  A'  has  a  probability  distribution  denoted  x  £  Rn,  i.e., 


Xj  =  prob(X  =  j),  j  —  1, . . . ,  n. 
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The  mutual  information  between  A'  and  Y  is  given  by 


I(X;Y)  =  EE  XjPij  log2 

i= 1  3=1 


Pij 

En 

k=l  XkPik 


Then  the  channel  capacity  C  is  given  by 

C  =  sup/(X;Y), 

X 


where  the  supremum  is  over  all  possible  probability  distributions  for  the  input  X ,  i.e., 
over  *  y  0,  lTx  =  1. 

Show  how  the  channel  capacity  can  be  computed  using  convex  optimization. 

Hint.  Introduce  the  variable  y  =  Px,  which  gives  the  probability  distribution  of  the 
output  Y ,  and  show  that  the  mutual  information  can  be  expressed  as 

m 

I{ x ;  Y )  =  cT X  -  E  Vi  loS2  Vi . 

i= 1 

where  Cj  =  J27=i  Pa  loS2 Pa, 

4.58  Optimal  consumption.  In  this  problem  we  consider  the  optimal  way  to  consume  (or  spend) 
an  initial  amount  of  money  (or  other  asset)  ko  over  time.  The  variables  are  Co, ...  ,ct, 
where  ct  >  0  denotes  the  consumption  in  period  t.  The  utility  derived  from  a  consumption 
level  c  is  given  by  it(c),  where  u  :  R  — ¥  R  is  an  increasing  concave  function.  The  present 
value  of  the  utility  derived  from  the  consumption  is  given  by 

T 

u  =  YjiMc7, 

t= o 


where  0  <  (5  <  1  is  a  discount  factor. 

Let  kt  denote  the  amount  of  money  available  for  investment  in  period  t.  We  assume 
that  it  earns  an  investment  return  given  by  f(kt),  where  /  :  R  -*  R  is  an  increasing, 
concave  investment  return  function,  which  satisfies  /( 0)  =  0.  For  example  if  the  funds 
earn  simple  interest  at  rate  R  percent  per  period,  we  have  f(a)  =  (_R/100)a.  The  amount 
to  be  consumed,  i.e.,  Ct,  is  withdrawn  at  the  end  of  the  period,  so  we  have  the  recursion 

kt+  i  =  kt  +  f(kt)  -  ct,  t  =  0,...,T. 

The  initial  sum  fco  >  0  is  given.  We  require  kt  >  0,  t  =  1, . . . ,  T+l  (but  more  sophisticated 
models,  which  allow  kt.  <  0,  can  be  considered). 

Show  how  to  formulate  the  problem  of  maximizing  U  as  a  convex  optimization  problem. 
Explain  how  the  problem  you  formulate  is  equivalent  to  this  one,  and  exactly  how  the 
two  are  related. 

Hint.  Show  that  we  can  replace  the  recursion  for  kt  given  above  with  the  inequalities 
kt+i  <  kt  +  f(kt)  -  ct,  t  =  0,...,T. 

(Interpretation:  the  inequalities  give  you  the  option  of  throwing  money  away  in  each 
period.)  For  a  more  general  version  of  this  trick,  see  exercise  4.6. 

4.59  Robust  optimization.  In  some  optimization  problems  there  is  uncertainty  or  variation 
in  the  objective  and  constraint  functions,  due  to  parameters  or  factors  that  are  either 
beyond  our  control  or  unknown.  We  can  model  this  situation  by  making  the  objective 
and  constraint  functions  fo,...,fm  functions  of  the  optimization  variable  x  £  Rn  and 
a  parameter  vector  u  £  Rfc  that  is  unknown,  or  varies.  In  the  stochastic  optimization 
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approach,  the  parameter  vector  u  is  modeled  as  a  random  variable  with  a  known  dis¬ 
tribution,  and  we  work  with  the  expected  values  E Ufi(x,u).  In  the  worst-case  analysis 
approach,  we  are  given  a  set  U  that  u  is  known  to  lie  in,  and  we  work  with  the  maximum 
or  worst-case  values  sup ueu  fi(x,u).  To  simplify  the  discussion,  we  assume  there  are  no 
equality  constraints. 

(a)  Stochastic  optimization.  We  consider  the  problem 

minimize  E/0(x,u) 

subject  to  Efi(x,u)<0,  i=l,...,m, 

where  the  expectation  is  with  respect  to  u.  Show  that  if  fi  are  convex  in  x  for  each 
u,  then  this  stochastic  optimization  problem  is  convex. 

(b)  Worst-case  optimization.  We  consider  the  problem 

minimize  sup„g(7  fo(x,  u) 

subject  to  sup„g£/  fi(x,  u)  <  0,  i=l,...,m. 

Show  that  if  fi  are  convex  in  x  for  each  u,  then  this  worst-case  optimization  problem 
is  convex. 

(c)  Finite  set  of  possible  parameter  values.  The  observations  made  in  parts  (a)  and  (b) 
are  most  useful  when  we  have  analytical  or  easily  evaluated  expressions  for  the 
expected  values  E  fi(x,u)  or  the  worst-case  values  supug(/  fi(x,u). 

Suppose  we  are  given  the  set  of  possible  values  of  the  parameter  is  finite,  i.e.,  we 
have  u  £  {u\, . . . ,  Un}-  For  the  stochastic  case,  we  are  also  given  the  probabilities 
of  each  value:  prob(u  =  Ui)  =  pi,  where  p  £  R  ,  p  y  0,  lTp  =  1.  In  the  worst-case 
formulation,  we  simply  take  U  £  {ui, . . . ,  mat}. 

Show  how  to  set  up  the  worst-case  and  stochastic  optimization  problems  explicitly 
(i.e.,  give  explicit  expressions  for  supug(7  fi  and  E„  fi). 

4.60  Log-optimal  investment  strategy.  We  consider  a  portfolio  problem  with  n  assets  held  over 
N  periods.  At  the  beginning  of  each  period,  we  re-invest  our  total  wealth,  redistributing 
it  over  the  n  assets  using  a  fixed,  constant,  allocation  strategy  x  £  Rn,  where  x  y  0, 
lTx  =  1.  In  other  words,  if  W(t  —  1)  is  our  wealth  at  the  beginning  of  period  t,  then 
during  period  t  we  invest  XiW (t  —  1)  in  asset  i.  We  denote  by  A (t)  the  total  return  during 
period  t,  i.e.,  \(t)  =  W(t)/W(t  —  1).  At  the  end  of  the  N  periods  our  wealth  has  been 
multiplied  by  the  factor  J"[j=1  A  (t).  We  call 

N 

log  AW 

t= i 

the  growth  rate  of  the  investment  over  the  N  periods.  We  are  interested  in  determining 
an  allocation  strategy  x  that  maximizes  growth  of  our  total  wealth  for  large  N . 

We  use  a  discrete  stochastic  model  to  account  for  the  uncertainty  in  the  returns.  We 
assume  that  during  each  period  there  are  m  possible  scenarios,  with  probabilities  nj, 
j  =  l,...,m.  In  scenario  j,  the  return  for  asset  i  over  one  period  is  given  by  pij. 
Therefore,  the  return  A(t)  of  our  portfolio  during  period  t  is  a  random  variable,  with 
m  possible  values  pjx, . . .  ,p^x ,  and  distribution 

7r j  =  prob(A(t)  =  pjx),  j  =  1, . . . ,  m. 

We  assume  the  same  scenarios  for  each  period,  with  (identical)  independent  distributions. 
Using  the  law  of  large  numbers,  we  have 

JToo  b log  m) =  ^  Elog  =  ElogAW  log^- 

v  7  «=i  j= i 
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In  other  words,  with  investment  strategy  x,  the  long  term  growth  rate  is  given  by 

m 

R\t =y^/Kjiog{pjx). 

3=1 

The  investment  strategy  x  that  maximizes  this  quantity  is  called  the  log-optimal  invest¬ 
ment  strategy,  and  can  be  found  by  solving  the  optimization  problem 

maximize  7r,-  log (pf  x) 

1  'T' 

subject  to  x  y  0,  l  id, 


with  variable  x  G  R’\ 

Show  that  this  is  a  convex  optimization  problem. 

4.61  Optimization  with  logistic  model.  A  random  variable  X  £  {0, 1}  satisfies 


prob(A'  =  1)  =  p 


exp(aT®  +  6) 

1  +  exp(aTx  +  b)  ’ 


where  x  £  R"  is  a  vector  of  variables  that  affect  the  probability,  and  a  and  b  are  known 
parameters.  We  can  think  of  X  =  1  as  the  event  that  a  consumer  buys  a  product,  and 
i  as  a  vector  of  variables  that  affect  the  probability,  e.g.,  advertising  effort,  retail  price, 
discounted  price,  packaging  expense,  and  other  factors.  The  variable  x,  which  we  are  to 
optimize  over,  is  subject  to  a  set  of  linear  constraints,  Fx  X  g. 

Formulate  the  following  problems  as  convex  optimization  problems. 

(a)  Maximizing  buying  probability.  The  goal  is  to  choose  x  to  maximize  p. 

(b)  Maximizing  expected  profit.  Let  cTx+d  be  the  profit  derived  from  selling  the  product, 
which  we  assume  is  positive  for  all  feasible  x.  The  goal  is  to  maximize  the  expected 
profit,  which  is  p(cT x  +  d). 

4.62  Optimal  power  and  bandwidth  allocation  in  a  Gaussian  broadcast  channel.  We  consider  a 
communication  system  in  which  a  central  node  transmits  messages  to  n  receivers.  (‘Gaus¬ 
sian’  refers  to  the  type  of  noise  that  corrupts  the  transmissions.)  Each  receiver  channel 
is  characterized  by  its  (transmit)  power  level  Pi  >  0  and  its  bandwidth  Wi  >  0.  The 
power  and  bandwidth  of  a  receiver  channel  determine  its  bit  rate  Ri  (the  rate  at  which 
information  can  be  sent)  via 


Ri  =  OtiWi  log(l  +  fiiPi/Wi), 

where  a;  and  fii  are  known  positive  constants.  For  Wi  =  0,  we  take  Ri  =  0  (which  is 
what  you  get  if  you  take  the  limit  as  Wi  — I  0). 

The  powers  must  satisfy  a  total  power  constraint,  which  has  the  form 

Pi  +  •  •  •  +  Pn  =  Ttot, 

where  Ptot  >  0  is  a  given  total  power  available  to  allocate  among  the  channels.  Similarly, 
the  bandwidths  must  satisfy 


Wi  +  ■  ■  ■  +  W„  =  Wtot, 


where  Wtot  >  0  is  the  (given)  total  available  bandwidth.  The  optimization  variables  in 
this  problem  are  the  powers  and  bandwidths,  i.e.,  P\, ....  Pn,  Wi, . . . ,  W„. 

The  objective  is  to  maximize  the  total  utility, 

n 

Ui(Ri ), 

i= 1 
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where  m  :  R  — >  R  is  the  utility  function  associated  with  the  ith  receiver.  (You  can 
think  of  Ui(Ri)  as  the  revenue  obtained  for  providing  a  bit  rate  Ri  to  receiver  i,  so  the 
objective  is  to  maximize  the  total  revenue.)  You  can  assume  that  the  utility  functions  Ui 
are  nondecreasing  and  concave. 

Pose  this  problem  as  a  convex  optimization  problem. 

4.63  Optimally  balancing  manufacturing  cost  and  yield.  The  vector  x  £  R"  denotes  the  nomi¬ 
nal  parameters  in  a  manufacturing  process.  The  yield  of  the  process,  i.e.,  the  fraction  of 
manufactured  goods  that  is  acceptable,  is  given  by  Y(x).  We  assume  that  Y  is  log-concave 
(which  is  often  the  case;  see  example  3.43).  The  cost  per  unit  to  manufacture  the  product 
is  given  by  cTx,  where  c  £  R".  The  cost  per  acceptable  unit  is  cTx/Y(x ).  We  want  to 
minimize  cTx/Y(x),  subject  to  some  convex  constraints  on  x  such  as  a  linear  inequalities 
Ax  Y  b.  (You  can  assume  that  over  the  feasible  set  we  have  cTx  >  0  and  Y(x)  >  0.) 

This  problem  is  not  a  convex  or  quasiconvex  optimization  problem,  but  it  can  be  solved 
using  convex  optimization  and  a  one-dimensional  search.  The  basic  ideas  are  given  below; 
you  must  supply  all  details  and  justification. 

(a)  Show  that  the  function  /  :  R  — »  R  given  by 

f(a)  =  sup{Y(®)  |  Ax  <  b,  cT x  =  a}, 

which  gives  the  maximum  yield  versus  cost,  is  log-concave.  This  means  that  by 
solving  a  convex  optimization  problem  (in  x)  we  can  evaluate  the  function  /. 

(b)  Suppose  that  we  evaluate  the  function  /  for  enough  values  of  a  to  give  a  good  approx¬ 
imation  over  the  range  of  interest.  Explain  how  to  use  these  data  to  (approximately) 
solve  the  problem  of  minimizing  cost  per  good  product. 

4.64  Optimization  with  recourse.  In  an  optimization  problem  with  recourse,  also  called  two- 
stage  optimization,  the  cost  function  and  constraints  depend  not  only  on  our  choice  of 
variables,  but  also  on  a  discrete  random  variable  s  £  (1, . . . ,  S},  which  is  interpreted  as 
specifying  which  of  S  scenarios  occurred.  The  scenario  random  variable  s  has  known 
probability  distribution  n,  with  ni  =  prob(s  =  i),  i  =  1 , ...  ,S. 

In  two-stage  optimization,  we  are  to  choose  the  values  of  two  variables,  x  £  R"  and 
z  £  R9.  The  variable  x  must  be  chosen  before  the  particular  scenario  s  is  known;  the 
variable  z,  however,  is  chosen  after  the  value  of  the  scenario  random  variable  is  known. 
In  other  words,  z  is  a  function  of  the  scenario  random  variable  s.  To  describe  our  choice 
z,  we  list  the  values  we  would  choose  under  the  different  scenarios,  i.e.,  we  list  the  vectors 

Zi, ...  ,zs  £  R9. 

Here  Z3  is  our  choice  of  2  when  s  =  3  occurs,  and  so  on.  The  set  of  values 

x  £  Rn ,  21 , . . . ,  zs  £  R9 

is  called  the  policy,  since  it  tells  us  what  choice  to  make  for  x  (independent  of  which 
scenario  occurs),  and  also,  what  choice  to  make  for  2  in  each  possible  scenario. 

The  variable  2  is  called  the  recourse  variable  (or  second-stage  variable),  since  it  allows 
us  to  take  some  action  or  make  a  choice  after  we  know  which  scenario  occurred.  In 
contrast,  our  choice  of  x  (which  is  called  the  first-stage  variable)  must  be  made  without 
any  knowledge  of  the  scenario. 

For  simplicity  we  will  consider  the  case  with  no  constraints.  The  cost  function  is  given  by 

/  :  Rn  xR'x(l,...,S)-)R, 

where  f(x,  z,  i)  gives  the  cost  when  the  first-stage  choice  x  is  made,  second-stage  choice 
2  is  made,  and  scenario  i  occurs.  We  will  take  as  the  overall  objective,  to  be  minimized 
over  all  policies,  the  expected  cost 

s 

E  f{x,zs,s)  =  y ^nif(x,Zj,i). 

i=  1 
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Suppose  that  /  is  a  convex  function  of  (x,z),  for  each  scenario  i  =  1, . . . ,  S.  Explain 
how  to  find  an  optimal  policy,  i.e.,  one  that  minimizes  the  expected  cost  over  all  possible 
policies,  using  convex  optimization. 

4.65  Optimal  operation  of  a  hybrid  vehicle.  A  hybrid  vehicle  has  an  internal  combustion  engine, 
a  motor/generator  connected  to  a  storage  battery,  and  a  conventional  (friction)  brake.  In 
this  exercise  we  consider  a  (highly  simplified)  model  of  a  parallel  hybrid  vehicle,  in  which 
both  the  motor/generator  and  the  engine  are  directly  connected  to  the  drive  wheels.  The 
engine  can  provide  power  to  the  wheels,  and  the  brake  can  take  power  from  the  wheels, 
turning  it  into  heat.  The  motor/generator  can  act  as  a  motor,  when  it  uses  energy  stored 
in  the  battery  to  deliver  power  to  the  wheels,  or  as  a  generator,  when  it  takes  power  from 
the  wheels  or  engine,  and  uses  the  power  to  charge  the  battery.  When  the  generator  takes 
power  from  the  wheels  and  charges  the  battery,  it  is  called  regenerative  braking ;  unlike 
ordinary  friction  braking,  the  energy  taken  from  the  wheels  is  stored,  and  can  be  used 
later.  The  vehicle  is  judged  by  driving  it  over  a  known,  fixed  test  track  to  evaluate  its 
fuel  efficiency. 

A  diagram  illustrating  the  power  flow  in  the  hybrid  vehicle  is  shown  below.  The  arrows 
indicate  the  direction  in  which  the  power  flow  is  considered  positive.  The  engine  power 
Peng ,  for  example,  is  positive  when  it  is  delivering  power;  the  brake  power  pbr  is  positive 
when  it  is  taking  power  from  the  wheels.  The  power  preq  is  the  required  power  at  the 
wheels.  It  is  positive  when  the  wheels  require  power  ( e.g .,  when  the  vehicle  accelerates, 
climbs  a  hill,  or  cruises  on  level  terrain).  The  required  wheel  power  is  negative  when  the 
vehicle  must  decelerate  rapidly,  or  descend  a  hill. 


wheels 


All  of  these  powers  are  functions  of  time,  which  we  discretize  in  one  second  intervals,  with 
t  =  1,2, ...  ,T.  The  required  wheel  power  preq(l), . . .  ,preq(T)  is  given.  (The  speed  of 
the  vehicle  on  the  track  is  specified,  so  together  with  known  road  slope  information,  and 
known  aerodynamic  and  other  losses,  the  power  required  at  the  wheels  can  be  calculated.) 
Power  is  conserved,  which  means  we  have 


Preq(t)  —  Peng  (I)  +  Pmgft)  Pbr(f),  t  —  1,  ...  ,T. 

The  brake  can  only  dissipate  power,  so  we  have  Pbr(f)  >  0  for  each  t.  The  engine  can  only 
provide  power,  and  only  up  to  a  given  limit  P™gX,  i.e.,  we  have 

0<Peng(f)<Pe”gX>  t  =  1,  .  .  .  ,T. 

The  motor/generator  power  is  also  limited:  pmg  must  satisfy 

Pm«in<Pmg(f)<P“gaX5  t=l,...,T. 

Here  PmgX  >  0  is  the  maximum  motor  power,  and  — P“gn  >  0  is  the  maximum  generator 
power. 

The  battery  charge  or  energy  at  time  t  is  denoted  Eft),  t  =  1, . . . ,  T  +  1.  The  battery 
energy  satisfies 


Eft  +  1)  =  Eft)  -  Pmg  ft)  -  rj\pmgft)\,  t  =  1, . . . ,  T, 
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where  77  >  0  is  a  known  parameter.  (The  term  — pmg(f)  represents  the  energy  removed 
or  added  the  battery  by  the  motor/generator,  ignoring  any  losses.  The  term  —  v\Pmg(t)\ 
represents  energy  lost  through  inefficiencies  in  the  battery  or  motor/generator.) 

The  battery  charge  must  be  between  0  (empty)  and  its  limit  E (full),  at  all  times.  (If 
Eft)  =  0,  the  battery  is  fully  discharged,  and  no  more  energy  can  be  extracted  from  it; 
when  Eft)  =  Eglfff,  the  battery  is  full  and  cannot  be  charged.)  To  make  the  comparison 
with  non-hybrid  vehicles  fair,  we  fix  the  initial  battery  charge  to  equal  the  final  battery 
charge,  so  the  net  energy  change  is  zero  over  the  track:  E(  1)  =  E(T  +  1).  We  do  not 
specify  the  value  of  the  initial  (and  final)  energy. 

The  objective  in  the  problem  (to  be  minimized)  is  the  total  fuel  consumed  by  the  engine, 
which  is 

T 

Ttotai  =  'y  '  F (peng(t)) , 

t=  1 

where  F  :  R  — >  R  is  the  fuel  use  characteristic  of  the  engine.  We  assume  that  F  is 
positive,  increasing,  and  convex. 

Formulate  this  problem  as  a  convex  optimization  problem,  with  variables  peng(t),  Pmgft), 
and  Pbi(t)  for  t  =  1 , ,T,  and  Eft)  for  t  =  1, . . .  ,T  +  1.  Explain  why  your  formulation 
is  equivalent  to  the  problem  described  above. 


Chapter  5 

Duality 


5.1  The  Lagrange  dual  function 

5.1.1  The  Lagrangian 


We  consider  an  optimization  problem  in  the  standard  form  (4.1): 


minimize  /o  ( x ) 

subject  to  fi(x)  <0,  i  =  1, . . . ,  m  (5.1) 

hi(x)  =  0,  i  =  l,...,p, 

with  variable  x  £  R”.  We  assume  its  domain  V  =  fj^0  dom  fi  n  nr=id°m^ 
is  nonempty,  and  denote  the  optimal  value  of  (5.1)  by  p* .  We  do  not  assume  the 
problem  (5.1)  is  convex. 

The  basic  idea  in  Lagrangian  duality  is  to  take  the  constraints  in  (5.1)  into 
account  by  augmenting  the  objective  function  with  a  weighted  sum  of  the  constraint 
functions.  We  define  the  Lagrangian  L  :  R™  x  Rm  x  Rp  — >•  R  associated  with  the 
problem  (5.1)  as 


m  p 

L(x,  A,  V )  =  fo(x)  +  ^ ifi(X )  + 

2=1  2=1 


with  dom  L  =  V  x  Rm  x  Rp.  We  refer  to  A,;  as  the  Lagrange  multiplier  associated 
with  the  ith  inequality  constraint  fi(x)  <  0;  similarly  we  refer  to  v j  as  the  Lagrange 
multiplier  associated  with  the  zth  equality  constraint  hi(x)  =  0.  The  vectors  A  and 
v  are  called  the  dual  variables  or  Lagrange  multiplier  vectors  associated  with  the 
problem  (5.1). 
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5.1.2  The  Lagrange  dual  function 

We  define  the  Lagrange  dual  function  (or  just  dual  function)  g  :  Rm  x  Rp  — >  R  as 
the  minimum  value  of  the  Lagrangian  over  x :  for  A  £  R?  n,  v  £  Rp, 

(m  p 

fo{x)  +  V  A ifi(x)  +  V'  Vihi(x ) 

7=  1  i= 1 

When  the  Lagrangian  is  unbounded  below  in  x,  the  dual  function  takes  on  the 
value  —  oo.  Since  the  dual  function  is  the  pointwise  infimum  of  a  family  of  affine 
functions  of  {\,v),  it  is  concave,  even  when  the  problem  (5.1)  is  not  convex. 

5.1.3  Lower  bounds  on  optimal  value 

The  dual  function  yields  lower  bounds  on  the  optimal  value  p*  of  the  problem  (5.1): 
For  any  A  y  0  and  any  v  we  have 

<?(A,  v)  <  p* ■  (5.2) 

This  important  property  is  easily  verified.  Suppose  a:  is  a  feasible  point  for  the 
problem  (5.1),  i.e.,  fi(x)  <  0  and  hi{x)  =  0,  and  A  >z  0.  Then  we  have 

771  P 

T,  A ifi(x)  +  y  Vihi(x)  <  0, 

2  =  1  2  =  1 

since  each  term  in  the  first  sum  is  nonpositive,  and  each  term  in  the  second  sum  is 
zero,  and  therefore 

m  p 

L(x,  A,  v)  =  fo(x)  +  y  A ifi(x)  +  y  Vihi(x)  <  f0(x). 

7=1  7=1 

Hence 

g{\v)  =  inf  L(x,  A,  v)  <  L(x,  A,  v)  <  f0(x). 

x€T> 

Since  g( X,u)  <  /0(x)  holds  for  every  feasible  point  x,  the  inequality  (5.2)  follows. 
The  lower  bound  (5.2)  is  illustrated  in  figure  5.1,  for  a  simple  problem  with  x  £  R 
and  one  inequality  constraint. 

The  inequality  (5.2)  holds,  but  is  vacuous,  when  g(A,^)  =  — oo.  The  dual 
function  gives  a  nontrivial  lower  bound  on  p*  only  when  A  ^  0  and  (A,  v)  £  domj, 
i.e.,  g{ A,  v)  >  — oo .  We  refer  to  a  pair  (A,  v)  with  A  0  and  (A,  v )  £  dom g  as  dual 
feasible,  for  reasons  that  will  become  clear  later. 

5.1.4  Linear  approximation  interpretation 

The  Lagrangian  and  lower  bound  property  can  be  given  a  simple  interpretation, 
based  on  a  linear  approximation  of  the  indicator  functions  of  the  sets  {0}  and  — R+. 
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Figure  5.1  Lower  bound  from  a  dual  feasible  point.  The  solid  curve  shows  the 
objective  function  /o,  and  the  dashed  curve  shows  the  constraint  function  /i. 
The  feasible  set  is  the  interval  [—0.46,0.46],  which  is  indicated  by  the  two 
dotted  vertical  lines.  The  optimal  point  and  value  are  x*  =  —0.46,  p*  =  1.54 
(shown  as  a  circle).  The  dotted  curves  show  L(x,  A)  for  A  =  0.1,  0.2, . . . ,  1.0. 
Each  of  these  has  a  minimum  value  smaller  than  p* ,  since  on  the  feasible  set 
(and  for  A  >  0)  we  have  L(x,  A)  <  fo[x). 


Figure  5.2  The  dual  function  g  for  the  problem  in  figure  5.1.  Neither  fo  nor 
/i  is  convex,  but  the  dual  function  is  concave.  The  horizontal  dashed  line 
shows  p* ,  the  optimal  value  of  the  problem. 
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We  first  rewrite  the  original  problem  (5.1)  as  an  unconstrained  problem, 


minimize  f0(x)  +  Ya=\  +  Ya= i  Jo (hi(x)), 


(5.3) 


where  /_  :  R  — >  R  is  the  indicator  function  for  the  nonpositive  reals, 


M«) 


0  u  <  0 
oo  u  >  0, 


and  similarly,  Iq  is  the  indicator  function  of  {0}.  In  the  formulation  (5.3),  the  func¬ 
tion  /_  (it)  can  be  interpreted  as  expressing  our  irritation  or  displeasure  associated 
with  a  constraint  function  value  u  =  fi(x):  It  is  zero  if  fi(x)  <  0,  and  infinite  if 
fi{x)  >0.  In  a  similar  way,  Io(u)  gives  our  displeasure  for  an  equality  constraint 
value  u  =  hi{x).  We  can  think  of  I  as  a  “brick  wall”  or  “infinitely  hard”  displea¬ 
sure  function;  our  displeasure  rises  from  zero  to  infinite  as  fi(x)  transitions  from 
nonpositive  to  positive. 

Now  suppose  in  the  formulation  (5.3)  we  replace  the  function  /_  (it)  with  the 
linear  function  A iU,  where  A i  >  0,  and  the  function  Iq{u )  with  i^it.  The  objective 
becomes  the  Lagrangian  function  L(x,  X,u),  and  the  dual  function  value  g( X,v)  is 
the  optimal  value  of  the  problem 


minimize  L(x,  A,  v)  =  f0(x)  +  Yh=i  Xifi(x )  +  Ef=i  (5-4) 


In  this  formulation,  we  use  a  linear  or  “soft”  displeasure  function  in  place  of  /_ 
and  Iq.  For  an  inequality  constraint,  our  displeasure  is  zero  when  fi(x)  =  0,  and  is 
positive  when  fi(x)  >  0  (assuming  A i  >0);  our  displeasure  grows  as  the  constraint 
becomes  “more  violated”.  Unlike  the  original  formulation,  in  which  any  nonpositive 
value  of  fi  (. x )  is  acceptable,  in  the  soft  formulation  we  actually  derive  pleasure  from 
constraints  that  have  margin,  i.e.,  from  fi(x)  <  0. 

Clearly  the  approximation  of  the  indicator  function  /_  (u)  with  a  linear  function 
A  iU  is  rather  poor.  But  the  linear  function  is  at  least  an  underestimator  of  the 
indicator  function.  Since  \u  <  I~{u)  and  v.i u  <  Iq(u)  for  all  it,  we  see  immediately 
that  the  dual  function  yields  a  lower  bound  on  the  optimal  value  of  the  original 
problem. 

The  idea  of  replacing  the  “hard”  constraints  with  “soft”  versions  will  come  up 
again  when  we  consider  interior-point  methods  (§11.2.1). 


5.1.5  Examples 

In  this  section  we  give  some  examples  for  which  we  can  derive  an  analytical  ex¬ 
pression  for  the  Lagrange  dual  function. 

Least-squares  solution  of  linear  equations 

We  consider  the  problem 

minimize  xT  x  ,  . 

subject  to  Ax  =  b,  '  '  ' 

where  A  €  Rpxn.  This  problem  has  no  inequality  constraints  andp  (linear)  equality 
constraints.  The  Lagrangian  is  L(x,v)  =  xTx  +  vT(Ax  —  b),  with  domain  R"  x 
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Rp.  The  dual  function  is  given  by  g{v)  =  mixL(x,  v).  Since  L(x,  v)  is  a  convex 
quadratic  function  of  x,  we  can  find  the  minimizing  x  from  the  optimality  condition 

VXL( x,  v)  =  2x  +  ATv  =  0, 

which  yields  x  =  —  (l/2)ATu.  Therefore  the  dual  function  is 

g(v)  —  L(—(1/2)Ati>,  v)  =  —{1/A)vtAAtu  —  bT is, 

which  is  a  concave  quadratic  function,  with  domain  Rp.  The  lower  bound  prop¬ 
erty  (5.2)  states  that  for  any  v  £  Rp,  we  have 

-(1/4 )vtAAtv  -  bTv  <  \ai{xTx  |  Ax  =  bj. 


Standard  form  LP 

Consider  an  LP  in  standard  form, 

minimize  cTx 

subject  to  Ax  =  b  (5.6) 

x  >z  0, 

which  has  inequality  constraint  functions  fi(x)  =  —Xi,  i  =  1, ...  ,n.  To  form 
the  Lagrangian  we  introduce  multipliers  Aj  for  the  n  inequality  constraints  and 
multipliers  v-,  for  the  ecjuality  constraints,  and  obtain 

n 

L{x,  A,  v)  =  cTx  —  ^  A iXi  +  vt(Ax  —  b)  =  —bTv  +  (c  +  ATv  —  X)Tx. 

i=l 

The  dual  function  is 

g( A,  v)  =  inf  L(x,  A,  v)  =  —bTv  +  inf(c  +  ATv  —  X)Tx, 

X  X 

which  is  easily  determined  analytically,  since  a  linear  function  is  bounded  below 
only  when  it  is  identically  zero.  Thus,  g( A,  v)  =  — oo  except  when  c  +  ATv  —  A  =  0, 
in  which  case  it  is  —bTv: 

,  _  f  — bT  v  ATv  —  A  +  c  =  0 
^  ’ U'  (  —oo  otherwise. 

Note  that  the  dual  function  g  is  finite  only  on  a  proper  affine  subset  of  Rm  x  Rp. 
We  will  see  that  this  is  a  common  occurrence. 

The  lower  bound  property  (5.2)  is  nontrivial  only  when  A  and  v  satisfy  A  ^  0 
and  ATv  —  A  +  c  =  0.  When  this  occurs,  —  bT v  is  a  lower  bound  on  the  optimal 
value  of  the  LP  (5.6). 


Two-way  partitioning  problem 

We  consider  the  (nonconvex)  problem 


minimize  xTWx 

subject  to  xf  =  1,  i  =  1, . . . ,  n, 


(5.7) 
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where  W  G  S".  The  constraints  restrict  the  values  of  x*  to  1  or  —  1,  so  the  problem 
is  equivalent  to  finding  the  vector  with  components  ±1  that  minimizes  xTWx.  The 
feasible  set  here  is  finite  (it  contains  2"  points)  so  this  problem  can  in  principle 
be  solved  by  simply  checking  the  objective  value  of  each  feasible  point.  Since  the 
number  of  feasible  points  grows  exponentially,  however,  this  is  possible  only  for 
small  problems  (say,  with  n  <  30).  In  general  (and  for  n  larger  than,  say,  50)  the 
problem  (5.7)  is  very  difficult  to  solve. 

We  can  interpret  the  problem  (5.7)  as  a  two-way  partitioning  problem  on  a  set 
of  n  elements,  say,  {1, . . .  ,n }:  A  feasible  x  corresponds  to  the  partition 

{1, . . . ,  n}  =  {i  |  Xi  =  -1}  U  {*  |  Xi  =  1}. 

The  matrix  coefficient  W,j  can  be  interpreted  as  the  cost  of  having  the  elements  i 
and  j  in  the  same  partition,  and  —  Wt-j  is  the  cost  of  having  i  and  j  in  different 
partitions.  The  objective  in  (5.7)  is  the  total  cost,  over  all  pairs  of  elements,  and 
the  problem  (5.7)  is  to  find  the  partition  with  least  total  cost. 

We  now  derive  the  dual  function  for  this  problem.  The  Lagrangian  is 

n 

L{x,v)  =  xTWx  +  ~~  1) 

»= 1 

=  xT(W  +  diag(^))a:  —  lTv. 

We  obtain  the  Lagrange  dual  function  by  minimizing  over  x: 

g{y )  =  inf  xT  (W  +  diag(^))a;  —  1T  v 

X 

(  —  l7  v  W  +  diag(jz)  y  0 

|  —  oo  otherwise, 

where  we  use  the  fact  that  the  infimum  of  a  quadratic  form  is  either  zero  (if  the 
form  is  positive  semidefinite)  or  — oo  (if  the  form  is  not  positive  semidefinite) . 

This  dual  function  provides  lower  bounds  on  the  optimal  value  of  the  difficult 
problem  (5.7).  For  example,  we  can  take  the  specific  value  of  the  dual  variable 

v  =  — Ami„(W/)l, 


which  is  dual  feasible,  since 

W  +  diag (i/)  =  W-  X min(W)I  h  0. 

This  yields  the  bound  on  the  optimal  value  p * 

P*  >  =  nAmin(W).  (5.8) 


Remark  5.1  This  lower  bound  on  p*  can  also  be  obtained  without  using  the  Lagrange 
dual  function.  First,  we  replace  the  constraints  x\  =  1, . . . ,  x%  =  1  with  y~)n_.,  xf  =  n, 
to  obtain  the  modified  problem 


minimize  xTWx 
subject  to  =  n- 


(5.9) 
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The  constraints  of  the  original  problem  (5.7)  imply  the  constraint  here,  so  the  optimal 
value  of  the  problem  (5.9)  is  a  lower  bound  on  p* ,  the  optimal  value  of  (5.7).  But  the 
modified  problem  (5.9)  is  easily  solved  as  an  eigenvalue  problem,  with  optimal  value 
U  Amin  (W). 


5.1.6  The  Lagrange  dual  function  and  conjugate  functions 

Recall  from  §3.3  that  the  conjugate  /*  of  a  function  /  :  R"  — >  R  is  given  by 

f*(y)=  sup  (yTx-f(x)). 

xGdom  / 

The  conjugate  function  and  Lagrange  dual  function  are  closely  related.  To  see  one 
simple  connection,  consider  the  problem 

minimize  f{%) 
subject  to  x  =  0 

(which  is  not  very  interesting,  and  solvable  by  inspection).  This  problem  has 
Lagrangian  L(x,  v)  =  /( x)  +  vTx,  and  dual  function 

giy)  =  inf  (/( x)  +  vTx)  =  -  sup  ({-v)T x  -  f(x))  = 

X  X 

More  generally  (and  more  usefully),  consider  an  optimization  problem  with 
linear  inequality  and  equality  constraints, 

minimize  fo(x) 

subject  to  Ax  A  b  (5.10) 

Cx  =  d. 

Using  the  conjugate  of  /o  we  can  write  the  dual  function  for  the  problem  (5.10)  as 
g{ A,  i/)  =  inf  (/0(x)  +  \T {Ax  —  b)  +  vT(Cx  —  d)) 

=  —bT A  —  dTv  +  inf  (fo(x)  +  (ATA  +  C'7V)ra;) 

X  V  7 

=  -bT\-dTv- fZ{-AT\~CTv).  (5.11) 

The  domain  of  g  follows  from  the  domain  of  /q  : 

dom g  =  {(A,  v)  \  —  AT X  —  CT v  G  dom  /q  }. 

Let  us  illustrate  this  with  a  few  examples. 

Equality  constrained  norm  minimization 

Consider  the  problem 

minimize  ||x|| 

subject  to  Ax  =  b,  \  ) 
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where  ||  •  ||  is  any  norm.  Recall  (from  example  3.26  on  page  93)  that  the  conjugate 
of  fo  =  II  '  II  is  given  by 


fo(y)  = 


o  Ibll*  <  i 

oo  otherwise, 


the  indicator  function  of  the  dual  norm  unit  ball. 

Using  the  result  (5.11)  above,  the  dual  function  for  the  problem  (5.12)  is  given 


by 


g(V)  =  -bTV-  fo(-ATu) 


—bTv  ||ATi/||*  <  1 
—oo  otherwise. 


Entropy  maximization 

Consider  the  entropy  maximization  problem 

minimize  f0(x)  =  X)"=  i  xi  l°g  x% 

subject  to  Ax  A  b  (5.13) 

lTx  =  1 

where  dom/o  =  R”  +  .  The  conjugate  of  the  negative  entropy  function  u  log  it, 
with  scalar  variable  u,  is  ev~x  (see  example  3.21  on  page  91).  Since  fo  is  a  sum  of 
negative  entropy  functions  of  different  variables,  we  conclude  that  its  conjugate  is 

n 

/o(i/)  =  EeW_1> 

«= i 

with  dom/g  =  Rn.  Using  the  result  (5.11)  above,  the  dual  function  of  (5.13)  is 
given  by 

n 

g{  A,  v)  =  ~bT  X 

i=l 

where  a,;  is  the  itli  column  of  A. 

Minimum  volume  covering  ellipsoid 

Consider  the  problem  with  variable  X  £  Sn, 

minimize  fo(X)  =  log  det  X  ' 1 

subject  to  afXdi  <  1,  i  =  1, . . . ,  m,  v  ' 

where  dom/0  =  S"  +  .  The  problem  (5.14)  has  a  simple  geometric  interpretation. 
With  each  X  £  S"+  we  associate  the  ellipsoid,  centered  at  the  origin, 

Ex  =  {z  |  zTXz  <  1}. 

1  /9 

The  volume  of  this  ellipsoid  is  proportional  to  (detX-1)  ,  so  the  objective 
of  (5.14)  is,  except  for  a  constant  and  a  factor  of  two,  the  logarithm  of  the  volume 


=  _5TA_^e-,-l^e-afA 

i= 1 
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of  Ex-  The  constraints  of  the  problem  (5.14)  are  that  a*  £  Ex-  Thus  the  prob¬ 
lem  (5.14)  is  to  determine  the  minimum  volume  ellipsoid,  centered  at  the  origin, 
that  includes  the  points  ai, . . . ,  am. 

The  inequality  constraints  in  problem  (5.14)  are  affine;  they  can  be  expressed 
as 

tr  ((aiaf)X)  <  1. 

In  example  3.23  (page  92)  we  found  that  the  conjugate  of  f0  is 


fo  ( y )  =  log  det(-y)  1  -  n, 

with  dom/g  =  — S"  +  .  Applying  the  result  (5.11)  above,  the  dual  function  for  the 
problem  (5.14)  is  given  by 

=  /  loSdet(ESi^aiaf)  -  lTA  +  n  ££=1  A^af  >-  0  ,  . 

^  1  —oo  otherwise. 


Thus,  for  any  A  >z  0  with  ^iaiaI  0,  the  number 


log  det 


1TA  +  n 


is  a  lower  bound  on  the  optimal  value  of  the  problem  (5.14). 


5.2  The  Lagrange  dual  problem 

For  each  pair  (A,  v)  with  A  >z  0,  the  Lagrange  dual  function  gives  us  a  lower  bound 
on  the  optimal  value  p *  of  the  optimization  problem  (5.1).  Thus  we  have  a  lower 
bound  that  depends  on  some  parameters  A,  v.  A  natural  question  is:  What  is  the 
best  lower  bound  that  can  be  obtained  from  the  Lagrange  dual  function? 

This  leads  to  the  optimization  problem 

maximize  g( \,v)  ,  , 

subject  to  A  y  0.  v  ■  > 

This  problem  is  called  the  Lagrange  dual  problem  associated  with  the  problem  (5.1). 
In  this  context  the  original  problem  (5.1)  is  sometimes  called  the  primal  problem. 
The  term  dual  feasible,  to  describe  a  pair  (A,  v)  with  A  ^  0  and  g{ A,  v)  >  —  oo, 
now  makes  sense.  It  means,  as  the  name  implies,  that  (A,  v)  is  feasible  for  the  dual 
problem  (5.16).  We  refer  to  (A*,  v*)  as  dual  optimal  or  optimal  Lagrange  multipliers 
if  they  are  optimal  for  the  problem  (5.16). 

The  Lagrange  dual  problem  (5.16)  is  a  convex  optimization  problem,  since  the 
objective  to  be  maximized  is  concave  and  the  constraint  is  convex.  This  is  the  case 
whether  or  not  the  primal  problem  (5.1)  is  convex. 
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5.2.1  Making  dual  constraints  explicit 

The  examples  above  show  that  it  is  not  uncommon  for  the  domain  of  the  dual 
function, 

doing  =  {(A,  v)  \  g{ A,  v)  >  -oo}, 

to  have  dimension  smaller  than  m  +  p.  In  many  cases  we  can  identify  the  affine 
hull  of  dome/,  and  describe  it  as  a  set  of  linear  equality  constraints.  Roughly 
speaking,  this  means  we  can  identify  the  equality  constraints  that  are  ‘hidden’  or 
‘implicit’  in  the  objective  g  of  the  dual  problem  (5.16).  In  this  case  we  can  form 
an  equivalent  problem,  in  which  these  equality  constraints  are  given  explicitly  as 
constraints.  The  following  examples  demonstrate  this  idea. 


Lagrange  dual  of  standard  form  LP 

On  page  219  we  found  that  the  Lagrange  dual  function  for  the  standard  form  LP 

minimize  cTx 

subject  to  Ax  =  b  (5.17) 

x  >z  0 


is  given  by 


9{\v) 


—bTv  ATv  —  A  +  c  =  0 
—oo  otherwise. 


Strictly  speaking,  the  Lagrange  dual  problem  of  the  standard  form  LP  is  to  maxi¬ 
mize  this  dual  function  g  subject  to  A  >z  0,  i.e., 


maximize  g( A,  v ) 
subject  to  A  >z  0. 


— bT  is  ATis  —  A  +  c  =  0 
—oo  otherwise 


(5.18) 


Here  g  is  finite  only  when  ATv  —  A  +  c  =  0.  We  can  form  an  equivalent  problem 
by  making  these  equality  constraints  explicit: 

maximize  —bTv 

subject  to  ATv  —  A  +  c  =  0  (5.19) 

A  ^  0. 


This  problem,  in  turn,  can  be  expressed  as 

maximize  —bTv  .  . 

subject  to  ATv  +  c  y  0, 

which  is  an  LP  in  inequality  form. 

Note  the  subtle  distinctions  between  these  three  problems.  The  Lagrange  dual 
of  the  standard  form  LP  (5.17)  is  the  problem  (5.18),  which  is  equivalent  to  (but 
not  the  same  as)  the  problems  (5.19)  and  (5.20).  With  some  abuse  of  terminology, 
we  refer  to  the  problem  (5.19)  or  the  problem  (5.20)  as  the  Lagrange  dual  of  the 
standard  form  LP  (5.17). 
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Lagrange  dual  of  inequality  form  LP 

In  a  similar  way  we  can  find  the  Lagrange  dual  problem  of  a  linear  program  in 
inequality  form 

minimize  cTx 

subject  to  Ax  A  b.  ‘ 

The  Lagrangian  is 

L(x,  A)  =  cTx  +  \T (Ax  —  b)  =  —bT A  +  (ATA  +  c)T  x, 
so  the  dual  function  is 


g( A)  =  inf  L(x,  A)  =  —bT A  +  inf(ATA  +  c)T x. 

X  X 


The  infimum  of  a  linear  function  is  —  oo,  except  in  the  special  case  when  it  is 
identically  zero,  so  the  dual  function  is 


5(A) 


—bT  A  AT  A  +  c  =  0 
—oo  otherwise. 


The  dual  variable  A  is  dual  feasible  if  A  0  and  AT A  +  c  =  0. 

The  Lagrange  dual  of  the  LP  (5.21)  is  to  maximize  g  over  all  A  0.  Again 
we  can  reformulate  this  by  explicitly  including  the  dual  feasibility  conditions  as 
constraints,  as  in 

maximize  -bT  A 

subject  to  AT A  +  c  =  0  (5.22) 

A  ^  0, 

which  is  an  LP  in  standard  form. 

Note  the  interesting  symmetry  between  the  standard  and  inequality  form  LPs 
and  their  duals:  The  dual  of  a  standard  form  LP  is  an  LP  with  only  inequality 
constraints,  and  vice  versa.  One  can  also  verify  that  the  Lagrange  dual  of  (5.22)  is 
(equivalent  to)  the  primal  problem  (5.21). 


5.2.2  Weak  duality 

The  optimal  value  of  the  Lagrange  dual  problem,  which  we  denote  d* ,  is,  by  def¬ 
inition,  the  best  lower  bound  on  p*  that  can  be  obtained  from  the  Lagrange  dual 
function.  In  particular,  we  have  the  simple  but  important  inequality 

d*  <  p*,  (5.23) 

which  holds  even  if  the  original  problem  is  not  convex.  This  property  is  called  weak 
duality. 

The  weak  duality  inequality  (5.23)  holds  when  d*  and  p*  are  infinite.  For 
example,  if  the  primal  problem  is  unbounded  below,  so  that  p*  =  —  oo,  we  must 
have  d*  =  — oo,  i.e.,  the  Lagrange  dual  problem  is  infeasible.  Conversely,  if  the 
dual  problem  is  unbounded  above,  so  that  d*  =  oo,  we  must  have  p*  =  oo,  i.e.,  the 
primal  problem  is  infeasible. 
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We  refer  to  the  difference  p*  —  d*  as  the  optimal  duality  gap  of  the  original 
problem,  since  it  gives  the  gap  between  the  optimal  value  of  the  primal  problem 
and  the  best  ( i.e .,  greatest)  lower  bound  on  it  that  can  be  obtained  from  the 
Lagrange  dual  function.  The  optimal  duality  gap  is  always  nonnegative. 

The  bound  (5.23)  can  sometimes  be  used  to  find  a  lower  bound  on  the  optimal 
value  of  a  problem  that  is  difficult  to  solve,  since  the  dual  problem  is  always  convex, 
and  in  many  cases  can  be  solved  efficiently,  to  find  d*.  As  an  example,  consider 
the  two-way  partitioning  problem  (5.7)  described  on  page  219.  The  dual  problem 
is  an  SDP, 

maximize  —  lTu 

subject  to  W  +  diag(iz)  >;  0, 

with  variable  v  £  R".  This  problem  can  be  solved  efficiently,  even  for  relatively 
large  values  of  n,  such  as  n  =  1000.  Its  optimal  value  is  a  lower  bound  on  the 
optimal  value  of  the  two-way  partitioning  problem,  and  is  always  at  least  as  good 
as  the  lower  bound  (5.8)  based  on  Amin(LF). 


5.2.3  Strong  duality  and  Slater’s  constraint  qualification 

If  the  equality 

d *  =  p*  (5.24) 

holds,  i.e.,  the  optimal  duality  gap  is  zero,  then  we  say  that  strong  duality  holds. 
This  means  that  the  best  bound  that  can  be  obtained  from  the  Lagrange  dual 
function  is  tight. 

Strong  duality  does  not,  in  general,  hold.  But  if  the  primal  problem  (5.1)  is 
convex,  i.e.,  of  the  form 

minimize  fo(%) 

subject  to  fi(x)  <0,  i  =  1, . . . ,  m,  (5.25) 

Ax  =  b, 

with  /o , ... ,  fm  convex,  we  usually  (but  not  always)  have  strong  duality.  There  are 
many  results  that  establish  conditions  on  the  problem,  beyond  convexity,  under 
which  strong  duality  holds.  These  conditions  are  called  constraint  qualifications. 

One  simple  constraint  qualification  is  Slater’s  condition:  There  exists  an  x  £ 
relint  V  such  that 

fi(x)  <0,  i  =  1, . . . ,  m,  Ax  =  b.  (5.26) 

Such  a  point  is  sometimes  called  strictly  feasible,  since  the  inequality  constraints 
hold  with  strict  inequalities.  Slater’s  theorem  states  that  strong  duality  holds,  if 
Slater’s  condition  holds  (and  the  problem  is  convex). 

Slater’s  condition  can  be  refined  when  some  of  the  inequality  constraint  func¬ 
tions  fi  are  affine.  If  the  first  k  constraint  functions  f\, ... ,  ff  are  affine,  then 
strong  duality  holds  provided  the  following  weaker  condition  holds:  There  exists 
an  x  £  relint  V  with 


fi(x)  <  0,  i  =  l,...,k, 


fi(x)  <0,  i  =  k  +  1, . . .  ,m, 


Ax  =  b.  (5.27) 
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In  other  words,  the  affine  inequalities  do  not  need  to  hold  with  strict  inequal¬ 
ity.  Note  that  the  refined  Slater  condition  (5.27)  reduces  to  feasibility  when  the 
constraints  are  all  linear  equalities  and  inequalities,  and  dom  /o  is  open. 

Slater’s  condition  (and  the  refinement  (5.27))  not  only  implies  strong  duality 
for  convex  problems.  It  also  implies  that  the  dual  optimal  value  is  attained  when 
d*  >  —  oo,  i.e.,  there  exists  a  dual  feasible  (X*,v*)  with  g( X*,v*)  =  d*  =  p*.  We 
will  prove  that  strong  duality  obtains,  when  the  primal  problem  is  convex  and 
Slater’s  condition  holds,  in  §5.3.2. 


5.2.4  Examples 

Least-squares  solution  of  linear  equations 

Recall  the  problem  (5.5): 

minimize  xTx 
subject  to  Ax  =  b. 

The  associated  dual  problem  is 

maximize  —(1/4  )vT  AATv  —  bTv, 

which  is  an  unconstrained  concave  quadratic  maximization  problem. 

Slater’s  condition  is  simply  that  the  primal  problem  is  feasible,  so  p*  =  d * 
provided  b  £  1Z(A),  i.e.,  p*  <  oo.  In  fact  for  this  problem  we  always  have  strong 
duality,  even  when  p*  =  oo.  This  is  the  case  when  b  1Z(A),  so  there  is  a  z  with 
ATz  =  0,  bT z  ^  0.  It  follows  that  the  dual  function  is  unbounded  above  along  the 
line  {tz  |  t  £  R},  so  d*  =  oo  as  well. 

Lagrange  dual  of  LP 

By  the  weaker  form  of  Slater’s  condition,  we  find  that  strong  duality  holds  for 
any  LP  (in  standard  or  inequality  form)  provided  the  primal  problem  is  feasible. 
Applying  this  result  to  the  duals,  we  conclude  that  strong  duality  holds  for  LPs 
if  the  dual  is  feasible.  This  leaves  only  one  possible  situation  in  which  strong 
duality  for  LPs  can  fail:  both  the  primal  and  dual  problems  are  infeasible.  This 
pathological  case  can,  in  fact,  occur;  see  exercise  5.23. 

Lagrange  dual  of  QCQP 

We  consider  the  QCQP 

minimize  (l/2)xTP0x  +  q^x  +  r0 

subject  to  (l/2)xT PiX  +  qfx  +  r.i  <  0,  i  =  l,...,m,  '  '  ' 

with  P0  £  S"+,  and  Pt  £  S" ,  i  =  1, . . . ,  to.  The  Lagrangian  is 
L(x,  A)  =  (l/2)xT P(X)x  +  q( X)Tx  +  r( A), 

where 

m  m  m 

P{ A)  =  Po  +  A jPj,  q( A)  =  qo  +  Xtfi,  r( A)  =  A^rj. 

i=l  i— 1  i=l 
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It  is  possible  to  derive  an  expression  for  g( A)  for  general  A,  but  it  is  quite  compli¬ 
cated.  If  A  0,  however,  we  have  P( A)  >~  0  and 

9( A)  =  inf  L(x,  A)  =  -{l/2)q(X)T  P(X)~1q(X)  +  r( A). 

X 

We  can  therefore  express  the  dual  problem  as 


maximize  —  (l/2)g(A)TP(A)  1g(A)  +  r(A) 
subject  to  A  y  0. 


(5.29) 


The  Slater  condition  says  that  strong  duality  between  (5.29)  and  (5.28)  holds  if  the 
quadratic  inequality  constraints  are  strictly  feasible,  i.e.,  there  exists  an  x  with 

(l/2)xT  PiX  +  q[x  +  r*  <  0,  i  =  1, . . .  ,m. 

Entropy  maximization 

Our  next  example  is  the  entropy  maximization  problem  (5.13): 

minimize  xi  1°S  xi 

subject  to  Ax  A  b 
1T  x  =  1, 

with  domain  D  =  R™.  The  Lagrange  dual  function  was  derived  on  page  222;  the 
dual  problem  is 


maximize  —bT X  —  v  —  e  v  1  "  a 4  x 


subject  to  A  y  0, 


En 

i= i 1 


(5.30) 


with  variables  A  £  Rm,  v  £  R.  The  (weaker)  Slater  condition  for  (5.13)  tells  us 
that  the  optimal  duality  gap  is  zero  if  there  exists  an  x  >-  0  with  Ax  A  b  and 
lTx  =  1. 

We  can  simplify  the  dual  problem  (5.30)  by  maximizing  over  the  dual  variable 
v  analytically.  For  fixed  A,  the  objective  function  is  maximized  when  the  derivative 
with  respect  to  v  is  zero,  i.e., 


=  log£ 


e-“*A-l. 


Substituting  this  optimal  value  of  v  into  the  dual  problem  gives 

maximize  —bT X  —  fog  (^”=i  e-°^A  j 

subject  to  A  y  0, 

which  is  a  geometric  program  (in  convex  form)  with  nonnegativity  constraints. 

Minimum  volume  covering  ellipsoid 

We  consider  the  problem  (5.14): 

minimize  log  det  X  ~~ 1 

subject  to  aJXcii  <1,  i  =  1, . . . ,  m, 
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with  domain  T>  =  S"  ,  .  The  Lagrange  dual  function  is  given  by  (5.15),  so  the  dual 
problem  can  be  expressed  as 

maximize  logdet  (X)™  \  Xid^aJ )  —  l7  A  +  n  ,, 

subject  to  A  y  0 

where  we  take  logdet  X  =  — oo  if  X  )/-  0. 

The  (weaker)  Slater  condition  for  the  problem  (5.14)  is  that  there  exists  an 
X  £  S"+  with  afXcii  <  1,  for  i  =  l,...,m.  This  is  always  satisfied,  so  strong 
duality  always  obtains  between  (5.14)  and  the  dual  problem  (5.31). 


A  nonconvex  quadratic  problem  with  strong  duality 

On  rare  occasions  strong  duality  obtains  for  a  nonconvex  problem.  As  an  important 
example,  we  consider  the  problem  of  minimizing  a  nonconvex  quadratic  function 
over  the  unit  ball, 

minimize  xTAx  +  2bTx 

subject  to  xTx  <1,  ' 

where  A  £  S”,  A  >£_  0,  and  b  £  R".  Since  A  >£_  0,  this  is  not  a  convex  problem.  This 
problem  is  sometimes  called  the  trust  region  problem ,  and  arises  in  minimizing  a 
second-order  approximation  of  a  function  over  the  unit  ball,  which  is  the  region  in 
which  the  approximation  is  assumed  to  be  approximately  valid. 

The  Lagrangian  is 

L(x,  A)  =  xT Ax  +  2 bTx  +  \{xTx  —  1)  =  xT(A  +  A I)x  +  2 bTx  —  A, 


so  the  dual  function  is  given  by 


5(A) 


-bT(A  +  XI)*b-  A  A  +  XIhO,  b  £  TZ(A  +  XI) 
—oo  otherwise, 


where  (A  +  A /)t  is  the  pseudo-inverse  of  A  +  XI.  The  Lagrange  dual  problem  is 
thus 

maximize  —bT (A  +  X I)^b  —  A 

subject  to  A  +  XI  y  0,  b£lZ(A  +  XI),  ^  ‘  ' 

with  variable  A  £  R.  Although  it  is  not  obvious  from  this  expression,  this  is  a 
convex  optimization  problem.  In  fact,  it  is  readily  solved  since  it  can  be  expressed 
as 

maximize  -  i  (qT b)2/(Xi+  X)  -  X 

subject  to  A  >  —  Ami„ (A), 

where  A ,  and  g.j  are  the  eigenvalues  and  corresponding  (orthonormal)  eigenvectors 
of  A,  and  we  interpret  (qfb)2/0  as  0  if  qfb  =  0  and  as  oo  otherwise. 

Despite  the  fact  that  the  original  problem  (5.32)  is  not  convex,  we  always  have 
zero  optimal  duality  gap  for  this  problem:  The  optimal  values  of  (5.32)  and  (5.33) 
are  always  the  same.  In  fact,  a  more  general  result  holds:  strong  duality  holds  for 
any  optimization  problem  with  quadratic  objective  and  one  quadratic  inequality 
constraint,  provided  Slater’s  condition  holds;  see  §B.l. 
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5.2.5  Mixed  strategies  for  matrix  games 


In  this  section  we  use  strong  duality  to  derive  a  basic  result  for  zero-sum  matrix 
games.  We  consider  a  game  with  two  players.  Player  1  makes  a  choice  (or  move) 
k  £  {1, ...  ,n},  and  player  2  makes  a  choice  l  £  {1, . . . ,  m}.  Player  1  then  makes  a 
payment  of  Pu  to  player  2,  where  P  £  R"xm  js  the  payoff  matrix  for  the  game. 
The  goal  of  player  1  is  to  make  the  payment  as  small  as  possible,  while  the  goal  of 
player  2  is  to  maximize  it. 

The  players  use  randomized  or  mixed  strategies ,  which  means  that  each  player 
makes  his  or  her  choice  randomly  and  independently  of  the  other  player’s  choice, 
according  to  a  probability  distribution: 

prob  (k  =  i)  =  Ui,  i=l,...,n,  prob(Z  =  i)  =  i  =  l,...,m. 


Here  u  and  v  give  the  probability  distributions  of  the  choices  of  the  two  players, 
i.e.,  their  associated  strategies.  The  expected  payoff  from  player  1  to  player  2  is 
then 

n  m 

EE  UkVlPkl  =  uTPv. 

fc= 1  l-l 

Player  1  wishes  to  choose  u  to  minimize  uT Pv,  while  player  2  wishes  to  choose  v 
to  maximize  uT Pv. 

Let  us  first  analyze  the  game  from  the  point  of  view  of  player  1,  assuming  her 
strategy  u  is  known  to  player  2  (which  clearly  gives  an  advantage  to  player  2). 
Player  2  will  choose  v  to  maximize  uT Pv,  which  results  in  the  expected  payoff 

sup{uTPv  |  v  >:  0,  lri>  =  1}  =  max  ( PTu)i . 


The  best  thing  player  1  can  do  is  to  choose  u  to  minimize  this  worst-case  payoff  to 
player  2,  i.e.,  to  choose  a  strategy  u  that  solves  the  problem 


minimize  maxj=ii...i7Tl(PTu)j 
subject  to  u  y  0,  1  Tu  =  1, 


(5.34) 


which  is  a  piecewise-linear  convex  optimization  problem.  We  will  denote  the  opti¬ 
mal  value  of  this  problem  as  p\.  This  is  the  smallest  expected  payoff  player  1  can 
arrange  to  have,  assuming  that  player  2  knows  the  strategy  of  player  1,  and  plays 
to  his  own  maximum  advantage. 

In  a  similar  way  we  can  consider  the  situation  in  which  v,  the  strategy  of 
player  2,  is  known  to  player  1  (which  gives  an  advantage  to  player  1).  In  this  case 
player  1  chooses  u  to  minimize  uT  Pv,  which  results  in  an  expected  payoff  of 

mi{uTPv  |  u  y  0,  1  Tu  =  1}  =  min  ( Pv)i- 

Player  2  chooses  v  to  maximize  this,  i.e.,  chooses  a  strategy  v  that  solves  the 
problem 


maximize  minj=ij...j„(Pu)j 
subject  to  v  y  0,  lTv  =  1, 


(5.35) 
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which  is  another  convex  optimization  problem,  with  piecewise-linear  (concave)  ob¬ 
jective.  We  will  denote  the  optimal  value  of  this  problem  as  pj.  This  is  the  largest 
expected  payoff  player  2  can  guarantee  getting,  assuming  that  player  1  knows  the 
strategy  of  player  2. 

It  is  intuitively  obvious  that  knowing  your  opponent’s  strategy  gives  an  advan¬ 
tage  (or  at  least,  cannot  hurt),  and  indeed,  it  is  easily  shown  that  we  always  have 
Pi  >  P2-  We  can  interpret  the  difference,  p\  —  p\,  which  is  nonnegative,  as  the 
advantage  conferred  on  a  player  by  knowing  the  opponent’s  strategy. 

Using  duality,  we  can  establish  a  result  that  is  at  first  surprising:  p*  =  p£. 
In  other  words,  in  a  matrix  game  with  mixed  strategies,  there  is  no  advantage  to 
knowing  your  opponent’s  strategy.  We  will  establish  this  result  by  showing  that 
the  two  problems  (5.34)  and  (5.35)  are  Lagrange  dual  problems,  for  which  strong 
duality  obtains. 

We  start  by  formulating  (5.34)  as  an  LP, 
minimize  t 

subject  to  u  y  0,  1  Tu  =  1 

PTu  ^  tl, 


with  extra  variable  f  £  R.  Introducing  the  multiplier  A  for  PT u  <t  1,  p  for  u  >;  0, 
and  v  for  1  Tu  =  1,  the  Lagrangian  is 

t  +  A  T(PTu  —  tl)  —  pTu  +  v(l  —  1  Tu)  =  v  +  (1  —  lTA)t  +  (PA  —  vl  —  p)Tu, 


so  the  dual  function  is 


g{ 


v  1tA  =  1,  PA  —  vl  =  p, 
—oo  otherwise. 


The  dual  problem  is  then 


maximize 
subject  to 


v 

A^O,  1TA  =  1,  phO 
PA  —  vl  =  p. 


Eliminating  p  we  obtain  the  following  Lagrange  dual  of  (5.34): 


maximize  v 

subject  to  A  y  0,  1TA  =  1 
PA  >:  vl, 


with  variables  A,  v.  But  this  is  clearly  equivalent  to  (5.35).  Since  the  LPs  are 
feasible,  we  have  strong  duality;  the  optimal  values  of  (5.34)  and  (5.35)  are  equal. 
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5.3  Geometric  interpretation 

5.3.1  Weak  and  strong  duality  via  set  of  values 

We  can  give  a  simple  geometric  interpretation  of  the  dual  function  in  terms  of  the 
set 

G  =  {(fi(x),---,fm{x),h1(x),...,hp(x),fo(x))  G  Rm  x  Rp  x  R  |  x  G  V},  (5.36) 

which  is  the  set  of  values  taken  on  by  the  constraint  and  objective  functions.  The 
optimal  value  p *  of  (5.1)  is  easily  expressed  in  terms  of  G  as 

p*  =  inf{f  |  (u,  v,  t)  G  G,  u<  0,  v  =  0}. 

To  evaluate  the  dual  function  at  (A,  is),  we  minimize  the  affine  function 

m  p 

(A,  is,  1  )T(u,  V,  t)  =  A iUi  +  ^2  ViVi  +  t 

i= 1  i= 1 

over  (u,v,t)  G  Q ,  i.e.,  we  have 

g(  A,  v)  =  inf  {(A,  is,  1  )T{u,  v,  t )  |  (w,  u,  f)  £  5}. 

In  particular,  we  see  that  if  the  infimum  is  finite,  then  the  inequality 

(A,  is,  l)T(u,v,t)  >  g(X,  is) 

defines  a  supporting  hyperplane  to  Q .  This  is  sometimes  referred  to  as  a  nonvertical 
supporting  hyperplane,  because  the  last  component  of  the  normal  vector  is  nonzero. 

Now  suppose  A  ^  0.  Then,  obviously,  t  >  (A,  is,  1  )T(u,  v,t)  if  u  <  0  and  v  =  0. 
Therefore 


p*  =  inf{t  |  ( u,v,t )  GO,  u  -<  0,  v  =  0} 

>  inf{(A,  is,  1  )T{u,v,t)  |  ( u ,  v,t)  e  G,  u  -<  0,  v  =  0} 

>  inf{(A,  is,  l)T(u,v,t)  |  (u,v,t)  G  Q} 

=  g(x,is), 

i.e.,  we  have  weak  duality.  This  interpretation  is  illustrated  in  figures  5.3  and  5.4, 
for  a  simple  problem  with  one  inequality  constraint. 

Epigraph  variation 

In  this  section  we  describe  a  variation  on  the  geometric  interpretation  of  duality  in 
terms  of  G,  which  explains  why  strong  duality  obtains  for  (most)  convex  problems. 
We  define  the  set  A  C  Rm  x  Rp  x  R  as 

A  =  G  +  (R+  x  {0}  x  R+) , 

or,  more  explicitly, 

A  =  {( u,v,t )  \3x  GD,  fi(x)  <Ui,  i  =  1,. . .  ,m, 
hi(x)  =vu  i  =  l,...,p,  f0(x)  <  t}, 


(5.37) 
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Figure  5.3  Geometric  interpretation  of  dual  function  and  lower  bound  g( A)  < 
p*,  for  a  problem  with  one  (inequality)  constraint.  Given  A,  we  minimize 
(A,  l)T(it,f)  over  Q  =  {(fi(x),  fo(x))  \  x  £  V}.  This  yields  a  supporting 
hyperplane  with  slope  —A.  The  intersection  of  this  hyperplane  with  the 
u  —  0  axis  gives  g{ A). 


Figure  5.4  Supporting  hyperplanes  corresponding  to  three  dual  feasible  val¬ 
ues  of  A,  including  the  optimum  A*.  Strong  duality  does  not  hold;  the 
optimal  duality  gap  p*  —  d*  is  positive. 
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Figure  5.5  Geometric  interpretation  of  dual  function  and  lower  bound  g( A)  < 
p* ,  for  a  problem  with  one  (inequality)  constraint.  Given  A,  we  minimize 
(A,  1  )T(u,t)  over  A  =  {(w, t)  \  3x  £  T>,  fo(x)  <  t,  fi(x)  <  it}.  This  yields 
a  supporting  hyperplane  with  slope  —A.  The  intersection  of  this  hyperplane 
with  the  u  =  0  axis  gives  g( A). 


We  can  think  of  A  as  a  sort  of  epigraph  form  of  Q,  since  A  includes  all  the  points  in 
Q,  as  well  as  points  that  are  ‘worse’,  i.e.,  those  with  larger  objective  or  inequality 
constraint  function  values. 

We  can  express  the  optimal  value  in  terms  of  A  as 

p*  =  inf{f  |  (0,0,  t)  €  A}. 

To  evaluate  the  dual  function  at  a  point  (A,  v)  with  A  ^  0,  we  can  minimize  the 
affine  function  (A,  zq  1  )T(u,v,t)  over  A:  If  A  y  0,  then 

g{ A,  z')  =  inf{(A,  v,  1  )T{u,  v,  t)  \  (u,  v,  t)  €  A}. 

If  the  inffinum  is  finite,  then 

(A,  v,  l)T(w,  v,  t)  >  g{\,v) 

defines  a  nonvertical  supporting  hyperplane  to  A. 

In  particular,  since  (0,0, p*)  G  bd^l,  we  have 

P *  =  (A,  v,  1)T(0, 0 ,p*)  >  g( A,  v),  (5.38) 

the  weak  duality  lower  bound.  Strong  duality  holds  if  and  only  if  we  have  equality 
in  (5.38)  for  some  dual  feasible  (A,  v),  i.e.,  there  exists  a  nonvertical  supporting 
hyperplane  to  A  at  its  boundary  point  (0,0,  p*). 

This  second  interpretation  is  illustrated  in  figure  5.5. 


5.3.2  Proof  of  strong  duality  under  constraint  qualification 

In  this  section  we  prove  that  Slater’s  constraint  qualification  guarantees  strong 
duality  (and  that  the  dual  optimum  is  attained)  for  a  convex  problem.  We  consider 
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the  primal  problem  (5.25),  with  fo,  ■  ■  ■ ,  fm  convex,  and  assume  Slater’s  condition 
holds:  There  exists  x  £  relint T>  with  fi(x)  <  0,  i  =  l,...,m,  and  Ax  =  6.  In 
order  to  simplify  the  proof,  we  make  two  additional  assumptions:  first  that  T>  has 
nonempty  interior  (hence,  relint  T>  =  int  V)  and  second,  that  rank  A  =  p.  We 
assume  that  p*  is  finite.  (Since  there  is  a  feasible  point,  we  can  only  have  p*  =  — oo 
or  p*  finite;  if  p*  =  — oo,  then  d*  =  —  oo  by  weak  duality.) 

The  set  A  defined  in  (5.37)  is  readily  shown  to  be  convex  if  the  underlying 
problem  is  convex.  We  define  a  second  convex  set  B  as 

B  =  {(0, 0,  s)  £  Rm  x  Rp  x  R  I  s  <  p*}. 

The  sets  A  and  B  do  not  intersect.  To  see  this,  suppose  ( u,v,t )  £  A  fl  B.  Since 
(it,  v,t)  £  B  we  have  u  =  0,  v  =  0,  and  t  <  p*.  Since  (u,  v,  t)  £  A,  there  exists  an  x 
with  fi(x)  <  0,  i  =  1, . . . ,  m,  Ax  —  6  =  0,  and  fo(x)  <  t  <  p*,  which  is  impossible 
since  p*  is  the  optimal  value  of  the  primal  problem. 

By  the  separating  hyperplane  theorem  of  §2.5.1  there  exists  (A,  i>,  p)  ^  0  and  a 
such  that 

(u,v,t)  £  A  =>  A Tu  +  vTv  +  pt>  a,  (5.39) 

and 

(u,v,t)  £  B  =>  XTu  +  vT v  +  pt  <  a.  (5.40) 

From  (5.39)  we  conclude  that  A  t:  0  and  p  >  0.  (Otherwise  A Tu  +  pt  is  unbounded 
below  over  A.  contradicting  (5.39).)  The  condition  (5.40)  simply  means  that  pt  <  a 
for  all  t  <  p*,  and  hence,  pp*  <  a.  Together  with  (5.39)  we  conclude  that  for  any 
x  £  T>, 

m 

^2  +  vT(Ax  -  b)  +  pfo{x)  >a>  pp*.  (5.41) 

i= 1 

Assume  that  n  >  0.  In  that  case  we  can  divide  (5.41)  by  fi  to  obtain 

L(x,\/p,v/p)  >  p* 

for  all  x  £  T>,  from  which  it  follows,  by  minimizing  over  x,  that  g( A,  u)  >  p* ,  where 
we  define 

A  =  A  Ip,  v  =  v  /  p. 

By  weak  duality  we  have  g{ \,v)  <  p*,  so  in  fact  g( X,u)  =  p*.  This  shows  that 
strong  duality  holds,  and  that  the  dual  optimum  is  attained,  at  least  in  the  case 
when  p  >  0. 

Now  consider  the  case  p  =  0.  From  (5.41),  we  conclude  that  for  all  x  £  T>, 

m 

Y,  A ifi(x)  +  vt(Ax  -  6)  >  0. 

i=  1 

Applying  this  to  the  point  x  that  satisfies  the  Slater  condition,  we  have 

m 

i= 1 


(5.42) 
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Figure  5.6  Illustration  of  strong  duality  proof,  for  a  convex  problem  that  sat¬ 
isfies  Slater’s  constraint  qualification.  The  set  A  is  shown  shaded,  and  the 
set  B  is  the  thick  vertical  line  segment,  not  including  the  point  (0,p*),  shown 
as  a  small  open  circle.  The  two  sets  are  convex  and  do  not  intersect,  so  they 
can  be  separated  by  a  hyperplane.  Slater’s  constraint  qualification  guaran¬ 
tees  that  any  separating  hyperplane  must  be  nonvertical,  since  it  must  pass 
to  the  left  of  the  point  (u,t)  =  (/i(i),  fo(x)),  where  x  is  strictly  feasible. 


Since  fi(x)  <  0  and  A i  >  0,  we  conclude  that  A  =  0.  From  (A,  /x)  ^  0  and 

A  =  0,  p  =  0,  we  conclude  that  v  ^  0.  Then  (5.42)  implies  that  for  all  x  £  T>, 
vT  (Ax  —  b)  >  0.  But  x  satisfies  vT  (Ax  —  6)  =  0,  and  since  x  £  intX>,  there  are 
points  in  V  with  vT  (Ax  —  b)  <  0  unless  ATv  =  0.  This,  of  course,  contradicts  our 
assumption  that  rank  A  =  p. 

The  geometric  idea  behind  the  proof  is  illustrated  in  figure  5.6,  for  a  simple 
problem  with  one  inequality  constraint.  The  hyperplane  separating  A  and  B  defines 
a  supporting  hyperplane  to  A  at  (0,p*).  Slater’s  constraint  qualification  is  used 
to  establish  that  the  hyperplane  must  be  nonvertical  (i.e.,  has  a  normal  vector  of 
the  form  (A*,  1)).  (For  a  simple  example  of  a  convex  problem  with  one  inequality 
constraint  for  which  strong  duality  fails,  see  exercise  5.21.) 


5.3.3  Multicriterion  interpretation 


There  is  a  natural  connection  between  Lagrange  duality  for  a  problem  without 
equality  constraints, 


minimize  fo(x) 

subject  to  fi(x)  <0,  i  =  1, . . . ,  m, 


(5.43) 
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and  the  scalarization  method  for  the  (unconstrained)  multicriterion  problem 

minimize  (w.r.t.  R™+1)  F(x)  =  fm(x ),  fo(x))  (5.44) 

(see  §4.7.4).  In  scalarization,  we  choose  a  positive  vector  A,  and  minimize  the  scalar 
function  A T F{x)\  any  minimizer  is  guaranteed  to  be  Pareto  optimal.  Since  we  can 
scale  A  by  a  positive  constant,  without  affecting  the  minimizers,  we  can,  without 
loss  of  generality,  take  A  =  (A,  1).  Thus,  in  scalarization  we  minimize  the  function 

m 

A tF(x)  =  f0(x)  +  ^ 

i- 1 

which  is  exactly  the  Lagrangian  for  the  problem  (5.43). 

To  establish  that  every  Pareto  optimal  point  of  a  convex  multicriterion  problem 
minimizes  the  function  A T F{x)  for  some  nonnegative  weight  vector  A,  we  considered 
the  set  A,  defined  in  (4.62), 

A  =  {t  e  Rm+1  |  3x  €  V ,  fi{x)  <ti,  i  =  0, . . . ,  m}, 

which  is  exactly  the  same  as  the  set  A  defined  in  (5.37),  that  arises  in  Lagrange  dual¬ 
ity.  Here  too  we  constructed  the  required  weight  vector  as  a  supporting  hyperplane 
to  the  set,  at  an  arbitrary  Pareto  optimal  point.  In  multicriterion  optimization, 
we  interpret  the  components  of  the  weight  vector  as  giving  the  relative  weights 
between  the  objective  functions.  When  we  fix  the  last  component  of  the  weight 
vector  (associated  with  f0)  to  be  one,  the  other  weights  have  the  interpretation  of 
the  cost  relative  to  /o,  i.e.,  the  cost  relative  to  the  objective. 


5.4  Saddle-point  interpretation 

In  this  section  we  give  several  interpretations  of  Lagrange  duality.  The  material  of 
this  section  will  not  be  used  in  the  sequel. 


5.4.1  Max-min  characterization  of  weak  and  strong  duality 

It  is  possible  to  express  the  primal  and  the  dual  optimization  problems  in  a  form 
that  is  more  symmetric.  To  simplify  the  discussion  we  assume  there  are  no  equality 
constraints;  the  results  are  easily  extended  to  cover  them. 

First  note  that 


sup  L(x,  A) 

A^O 


sup 

A^o 


f  m 

fo(x)  +  ^Ai/ji 

A  i=  1 


fo{x)  fi(x)  <  0,  i  =  1, . . .  ,m 
oo  otherwise. 
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Indeed,  suppose  x  is  not  feasible,  and  fi(x)  >  0  for  some  i.  Then  supA^0  L(x,  A)  = 
oo,  as  can  be  seen  by  choosing  A j  =  0,  j  7^  i.  and  A ,  — >  00.  On  the  other- 
hand,  if  fi(x)  <  0,  i  =  1  then  the  optimal  choice  of  A  is  A  =  0  and 

supAJ_0  L(x,  A)  =  fo(x).  This  means  that  we  can  express  the  optimal  value  of  the 
primal  problem  as 

p*  =  inf  sup  L( x,X). 

x  A^O 

By  the  definition  of  the  dual  function,  we  also  have 

d*  =  sup  inf  L{x,  A). 

A^O  x 

Thus,  weak  duality  can  be  expressed  as  the  inequality 

sup  inf  L(x,  A)  <  inf  sup  L(x,  A),  (5.45) 

AXO  x  x  AXO 

and  strong  duality  as  the  equality 

sup  inf  L(x,  A)  =  inf  sup  L(x,  A). 

A^O  x  x  A>^0 

Strong  duality  means  that  the  order  of  the  minimization  over  x  and  the  maximiza- 
tion  over  \  >Q  can  be  switched  without  affecting  the  result. 

In  fact,  the  inequality  (5.45)  does  not  depend  on  any  properties  of  L:  We  have 

sup  inf  f(w,z)<  inf  sup  f(w,z)  (5.46) 

z€Z  w€W  wGW  zez 

for  any  /  :  R"  x  R"1  — >  R  (and  any  W  C  R"  and  Z  C  Rm).  This  general  inequality 
is  called  the  max-min  inequality.  When  equality  holds,  i.e., 

sup  inf  f(w,z)=  inf  sup  f(w,z)  (5-47) 

zGZ  w€:W  wGW  z£Z 

we  say  that  /  (and  W  and  Z)  satisfy  the  strong  max-min  property  or  the  saddle- 
point  property.  Of  course  the  strong  max-min  property  holds  only  in  special  cases, 
for  example,  when  /  :  Rn  x  Rm  — >  R  is  the  Lagrangian  of  a  problem  for  which 
strong  duality  obtains,  W  =  R",  and  Z  =  R™. 


5.4.2  Saddle-point  interpretation 


We  refer  to  a  pair  w  G  IT,  z  £  Z  as  a  saddle-point  for  /  (and  IT  and  Z)  if 

f(w,z)  <  f(w,z )  <  f(w,z) 

for  all  w  G  IT  and  z  £  Z.  In  other  words,  w  minimizes  f(w,z)  (over  w  £  IT)  and 
z  maximizes  f(w,z)  (over  z  £  Z): 

f(w,z)=  inf  f(w,z),  f(w,z)  =  sup  f(w,z). 
w£W  zez 
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This  implies  that  the  strong  max-min  property  (5.47)  holds,  and  that  the  common 
value  is  f(w,z). 

Returning  to  our  discussion  of  Lagrange  duality,  we  see  that  if  x*  and  A*  are 
primal  and  dual  optimal  points  for  a  problem  in  which  strong  duality  obtains,  they 
form  a  saddle-point  for  the  Lagrangian.  The  converse  is  also  true:  If  (x,  A)  is  a 
saddle-point  of  the  Lagrangian,  then  x  is  primal  optimal,  A  is  dual  optimal,  and 
the  optimal  duality  gap  is  zero. 


5.4.3  Game  interpretation 

We  can  interpret  the  max-min  inequality  (5.46),  the  max-min  equality  (5.47),  and 
the  saddle-point  property,  in  terms  of  a  continuous  zero-sum  game.  If  the  first 
player  chooses  w  £  W,  and  the  second  player  selects  z  £  Z,  then  player  1  pays  an 
amount  f(w,  z)  to  player  2.  Player  1  therefore  wants  to  minimize  /,  while  player  2 
wants  to  maximize  /.  (The  game  is  called  continuous  since  the  choices  are  vectors, 
and  not  discrete.) 

Suppose  that  player  1  makes  his  choice  first,  and  then  player  2,  after  learning 
the  choice  of  player  1,  makes  her  selection.  Player  2  wants  to  maximize  the  payoff 
f(w,z),  and  so  will  choose  z  £  Z  to  maximize  f(w,z).  The  resulting  payoff  will 
be  supzgZ  f{w,  z),  which  depends  on  w,  the  choice  of  the  first  player.  (We  assume 
here  that  the  supremum  is  achieved;  if  not  the  optimal  payoff  can  be  arbitrarily 
close  to  supzgZ  f(w,  z).)  Player  1  knows  (or  assumes)  that  player  2  will  follow  this 
strategy,  and  so  will  choose  w  £  W  to  make  this  worst-case  payoff  to  player  2  as 
small  as  possible.  Thus  player  1  chooses 

argmin  sup  f(w,  z), 
wew  zez 

which  results  in  the  payoff 

inf  sup  f(w,  z) 

weW  zez 

from  player  1  to  player  2. 

Now  suppose  the  order  of  play  is  reversed:  Player  2  must  choose  z  £  Z  first,  and 
then  player  1  chooses  w  £  W  (with  knowledge  of  z).  Following  a  similar  argument, 
if  the  players  follow  the  optimal  strategy,  player  2  should  choose  z  £  Z  to  maximize 
inf W£w  f(w,  z ),  which  results  in  the  payoff  of 

sup  inf  f(w,  z) 
zez  wew 


from  player  1  to  player  2. 

The  max-min  inequality  (5.46)  states  the  (intuitively  obvious)  fact  that  it  is 
better  for  a  player  to  go  second,  or  more  precisely,  for  a  player  to  know  his  or  her 
opponent’s  choice  before  choosing.  In  other  words,  the  payoff  to  player  2  will  be 
larger  if  player  1  must  choose  first.  When  the  saddle-point  property  (5.47)  holds, 
there  is  no  advantage  to  playing  second. 

If  (w,z)  is  a  saddle-point  for  /  (and  W  and  Z),  then  it  is  called  a  solution  of 
the  game;  w  is  called  the  optimal  choice  or  strategy  for  player  1,  and  5  is  called 
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the  optimal  choice  or  strategy  for  player  2.  In  this  case  there  is  no  advantage  to 
playing  second. 

Now  consider  the  special  case  where  the  payoff  function  is  the  Lagrangian, 
W  =  Rn  and  Z  =  RT.  Here  player  1  chooses  the  primal  variable  x,  while  player  2 
chooses  the  dual  variable  A  >;  0.  By  the  argument  above,  the  optimal  choice  for 
player  2,  if  she  must  choose  first,  is  any  A*  which  is  dual  optimal,  which  results 
in  a  payoff  to  player  2  of  d*.  Conversely,  if  player  1  must  choose  first,  his  optimal 
choice  is  any  primal  optimal  x* ,  which  results  in  a  payoff  of  p* . 

The  optimal  duality  gap  for  the  problem  is  exactly  equal  to  the  advantage 
afforded  the  player  who  goes  second,  i.e.,  the  player  who  has  the  advantage  of 
knowing  his  or  her  opponent’s  choice  before  choosing.  If  strong  duality  holds,  then 
there  is  no  advantage  to  the  players  of  knowing  their  opponent’s  choice. 


5.4.4  Price  or  tax  interpretation 

Lagrange  duality  has  an  interesting  economic  interpretation.  Suppose  the  variable 
x  denotes  how  an  enterprise  operates  and  fo(x)  denotes  the  cost  of  operating  at 
x,  i.e.,  —fo(x)  is  the  profit  (say,  in  dollars)  made  at  the  operating  condition  x. 
Each  constraint  fi(x)  <  0  represents  some  limit,  such  as  a  limit  on  resources  ( e.g ., 
warehouse  space,  labor)  or  a  regulatory  limit  (e.g.,  environmental).  The  operating 
condition  that  maximizes  profit  while  respecting  the  limits  can  be  found  by  solving 
the  problem 

minimize  fo(x) 

subject  to  fi(x)  <0,  i  =  1, . . . ,  m. 

The  resulting  optimal  profit  is  —p*. 

Now  imagine  a  second  scenario  in  which  the  limits  can  be  violated,  by  paying  an 
additional  cost  which  is  linear  in  the  amount  of  violation,  measured  by  _/).  Thus  the 
payment  made  by  the  enterprise  for  the  ith  limit  or  constraint  is  X,  f,  (x).  Payments 
are  also  made  to  the  firm  for  constraints  that  are  not  tight;  if  fi(x)  <  0,  then  A ifi(x) 
represents  a  payment  to  the  firm.  The  coefficient  A i  has  the  interpretation  of  the 
price  for  violating  fi(x)  <  0;  its  units  are  dollars  per  unit  violation  (as  measured 
by  ft).  For  the  same  price  the  enterprise  can  sell  any  ‘unused’  portion  of  the  ith 
constraint.  We  assume  A,;  >  0,  i.e.,  the  firm  must  pay  for  violations  (and  receives 
income  if  a  constraint  is  not  tight). 

As  an  example,  suppose  the  first  constraint  in  the  original  problem,  fi(x)  < 
0,  represents  a  limit  on  warehouse  space  (say,  in  square  meters).  In  this  new 
arrangement,  we  open  the  possibility  that  the  firm  can  rent  extra  warehouse  space 
at  a  cost  of  Ai  dollars  per  square  meter  and  also  rent  out  unused  space,  at  the  same 
rate. 

The  total  cost  to  the  firm,  for  operating  condition  x,  and  constraint  prices 
A i,  is  L(x,X)  =  fo(x)  +  Y^iLi  Xifi(x).  The  firm  will  obviously  operate  so  as  to 
minimize  its  total  cost  L(x,  A),  which  yields  a  cost  g( A).  The  dual  function  therefore 
represents  the  optimal  cost  to  the  firm,  as  a  function  of  the  constraint  price  vector 
A.  The  optimal  dual  value,  d*,  is  the  optimal  cost  to  the  enterprise  under  the  least 
favorable  set  of  prices. 
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Using  this  interpretation  we  can  paraphrase  weak  duality  as  follows:  The  opti¬ 
mal  cost  to  the  firm  in  the  second  scenario  (in  which  constraint  violations  can  be 
bought  and  sold)  is  less  than  or  equal  to  the  cost  in  the  original  situation  (which 
has  constraints  that  cannot  be  violated),  even  with  the  most  unfavorable  prices. 
This  is  obvious:  If  x *  is  optimal  in  the  first  scenario,  then  the  operating  cost  of  x* 
in  the  second  scenario  will  be  lower  than  fo{x*),  since  some  income  can  be  derived 
from  the  constraints  that  are  not  tight.  The  optimal  duality  gap  is  then  the  min¬ 
imum  possible  advantage  to  the  enterprise  of  being  allowed  to  pay  for  constraint 
violations  (and  receive  payments  for  nontight  constraints). 

Now  suppose  strong  duality  holds,  and  the  dual  optimum  is  attained.  We  can 
interpret  a  dual  optimal  A*  as  a  set  of  prices  for  which  there  is  no  advantage  to 
the  firm  in  being  allowed  to  pay  for  constraint  violations  (or  receive  payments  for 
nontight  constraints).  For  this  reason  a  dual  optimal  A*  is  sometimes  called  a  set 
of  shadow  prices  for  the  original  problem. 


5.5  Optimality  conditions 

We  remind  the  reader  that  we  do  not  assume  the  problem  (5.1)  is  convex,  unless 
explicitly  stated. 


5.5.1  Certificate  of  suboptimality  and  stopping  criteria 

If  we  can  find  a  dual  feasible  (A,  u),  we  establish  a  lower  bound  on  the  optimal  value 
of  the  primal  problem:  p *  >  g( \,v).  Thus  a  dual  feasible  point  (A,  iz)  provides  a 
proof  or  certificate  that  p *  >  g(A,v).  Strong  duality  means  there  exist  arbitrarily 
good  certificates. 

Dual  feasible  points  allow  us  to  bound  how  suboptimal  a  given  feasible  point 
is,  without  knowing  the  exact  value  of  p* .  Indeed,  if  x  is  primal  feasible  and  (A,  v) 
is  dual  feasible,  then 

f0(x)  -  p*  <  f0(x)  -  g{ A,  v). 

In  particular,  this  establishes  that  x  is  e-suboptimal,  with  e  =  fo(x)  —  g(A,u).  (It 
also  establishes  that  (A,  v)  is  e-suboptimal  for  the  dual  problem.) 

We  refer  to  the  gap  between  primal  and  dual  objectives, 

fo(x)  -g{A,v), 

as  the  duality  gap  associated  with  the  primal  feasible  point  x  and  dual  feasible 
point  (A,  v).  A  primal  dual  feasible  pair  x,  (A,  v)  localizes  the  optimal  value  of  the 
primal  (and  dual)  problems  to  an  interval: 

P *  e  [g{ A,  v),fo(x)\,  d *  G  [g(X,  v),  f0{x)], 

the  width  of  which  is  the  duality  gap. 

If  the  duality  gap  of  the  primal  dual  feasible  pair  x,  (A,  v)  is  zero,  i.e.,  fo{x)  = 
g( A,  v),  then  x  is  primal  optimal  and  (A,  v)  is  dual  optimal.  We  can  think  of  (A,  v) 
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as  a  certificate  that  proves  x  is  optimal  (and,  similarly,  we  can  think  of  £  as  a 
certificate  that  proves  (A,  v)  is  dual  optimal). 

These  observations  can  be  used  in  optimization  algorithms  to  provide  nonheuris¬ 
tic  stopping  criteria.  Suppose  an  algorithm  produces  a  sequence  of  primal  feasible 
x and  dual  feasible  (A^,  zAfcl),  for  k  =  1,2,.. .,  and  eabs  >  0  is  a  given  required 
absolute  accuracy.  Then  the  stopping  criterion  (be.,  the  condition  for  terminating 
the  algorithm) 

fo(x{k))-g{X{k)Mk))  <eabs 


guarantees  that  when  the  algorithm  terminates,  x ^  is  eabs-suboptimal.  Indeed, 
(A(fc\zAfc))  is  a  certificate  that  proves  it.  (Of  course  strong  duality  must  hold  if 
this  method  is  to  work  for  arbitrarily  small  tolerances  eabs-) 

A  similar  condition  can  be  used  to  guarantee  a  given  relative  accuracy  erei  >  0. 


If 


5(A<*>, !/<*>)  >0, 


/0(#)-g(AW,rW) 

g{  A(0,  j/fe)) 


—  ^rel 


holds,  or 


fo(x{k))  <  o, 


fo(x^)-g(X^,u^) 

- fo(x{k) ) 


5:  ^rel 


holds,  then  p*  ^  0  and  the  relative  error 


fo(XW)-p* 
\p*  I 


is  guaranteed  to  be  less  than  or  equal  to  erei. 


5.5.2  Complementary  slackness 


Suppose  that  the  primal  and  dual  optimal  values  are  attained  and  equal  (so,  in 
particular,  strong  duality  holds).  Let  x *  be  a  primal  optimal  and  (A*,  v*)  be  a  dual 
optimal  point.  This  means  that 


foix*)  = 


< 

< 


g(  A*,^*) 

inf  ( f0(x) 


)A  ifi(x)  +  ^2v*hi 

i= 1 


(a:) 


m  p 

fo(x*)  +  Xifi(x *)  +  Y  Vihi(X*) 
2=1  1=1 

fo(x*). 


The  first  line  states  that  the  optimal  duality  gap  is  zero,  and  the  second  line  is 
the  definition  of  the  dual  function.  The  third  line  follows  since  the  infimum  of  the 
Lagrangian  over  x  is  less  than  or  equal  to  its  value  at  x  =  x*.  The  last  inequality 
follows  from  A*  >  0,  fi(x*)  <  0,  i  =  1, . . . ,  to,  and  ht(x*)  =  0,  i  =  1, . . .  ,p.  We 
conclude  that  the  two  inequalities  in  this  chain  hold  with  equality. 
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We  can  draw  several  interesting  conclusions  from  this.  For  example,  since  the 
inequality  in  the  third  line  is  an  equality,  we  conclude  that  x *  minimizes  L(x,  A*,  u*) 
over  x.  (The  Lagrangian  L(x,  X* ,u*)  can  have  other  minimizers;  x *  is  simply  a 
minimizer.) 

Another  important  conclusion  is  that 

m 

i= 1 

Since  each  term  in  this  sum  is  nonpositive,  we  conclude  that 

\*fi(x*)  =  0,  i  =  1, . . .  ,m.  (5.48) 


This  condition  is  known  as  complementary  slackness ;  it  holds  for  any  primal  opti¬ 
mal  x *  and  any  dual  optimal  (A*,  v*)  (when  strong  duality  holds).  We  can  express 
the  complementary  slackness  condition  as 

A*  >  0  =►  =  0, 


or,  equivalently, 


fi(x*)  <  0 


A*  =  0. 


Roughly  speaking,  this  means  the  zth  optimal  Lagrange  multiplier  is  zero  unless 
the  zth  constraint  is  active  at  the  optimum. 


5.5.3  KKT  optimality  conditions 

We  now  assume  that  the  functions  /o,  •  •  • ,  /m,  hi, . . . ,  hp  are  differentiable  (and 
therefore  have  open  domains),  but  we  make  no  assumptions  yet  about  convexity. 

KKT  conditions  for  nonconvex  problems 

As  above,  let  x*  and  (A *,i/*)  be  any  primal  and  dual  optimal  points  with  zero 
duality  gap.  Since  x*  minimizes  L(x,  over  x,  it  follows  that  its  gradient 

must  vanish  at  x*,  i.e., 

m  p 

V/„(s*)  +  E  A  ?  V/i(**)  +  ^  vfVhiix*)  =  0. 

i=l  i= 1 

Thus  we  have 

fi(x*)  <  o, 
hi(x*)  =  0, 

A*  >  0, 

A* /»(**)  =  0, 

V/o(®*)  +  E2=1  A* V/i(a:*)  +  ]Tf=i  u^Vh^x*)  =  0, 

which  are  called  the  Karush- Kuhn-  Tucker  (KKT)  conditions. 

To  summarize,  for  any  optimization  problem  with  differentiable  objective  and 
constraint  functions  for  which  strong  duality  obtains,  any  pair  of  primal  and  dual 
optimal  points  must  satisfy  the  KKT  conditions  (5.49). 


z  =  1, . . . ,  m 
i  =  l,...,p 
i  =  1, . . .  ,m  (5.49) 
z  =  1, . . . ,  m 
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KKT  conditions  for  convex  problems 

When  the  primal  problem  is  convex,  the  KKT  conditions  are  also  sufficient  for  the 
points  to  be  primal  and  dual  optimal.  In  other  words,  if  /,;  are  convex  and  hi  are 
affine,  and  x,  A,  v  are  any  points  that  satisfy  the  KKT  conditions 


v/o(*)  +  £™i 


fi(x) 

< 

o, 

*  =  !>•■ 

. ,  m 

hi(x) 

= 

o, 

*  =  1,.. 

• 

A  i 

> 

0, 

*  =  !,•• 

. ,  m 

A»  fi(x) 

= 

0, 

*  =  !,.. 

. ,  m 

ViVhi(x) 

= 

0, 

then  x  and  (A,  v)  are  primal  and  dual  optimal,  with  zero  duality  gap. 

To  see  this,  note  that  the  first  two  conditions  state  that  x  is  primal  feasible. 
Since  A,;  >  0,  L{x ,  A,  v)  is  convex  in  x\  the  last  KKT  condition  states  that  its 
gradient  with  respect  to  x  vanishes  at  x  =  x,  so  it  follows  that  x  minimizes  L(x,  A,  v) 
over  x.  From  this  we  conclude  that 


g(\v)  =  L(i,A,i>) 

rrc  p 

=  fo(x )  +  ^  +  X! 

i=l  i= 1 

=  fo(x), 

where  in  the  last  line  we  use  hi(x)  =  0  and  A ifi{x)  =  0.  This  shows  that  x 
and  (A,  v)  have  zero  duality  gap,  and  therefore  are  primal  and  dual  optimal.  In 
summary,  for  any  convex  optimization  problem  with  differentiable  objective  and 
constraint  functions,  any  points  that  satisfy  the  KKT  conditions  are  primal  and 
dual  optimal,  and  have  zero  duality  gap. 

If  a  convex  optimization  problem  with  differentiable  objective  and  constraint 
functions  satisfies  Slater’s  condition,  then  the  KKT  conditions  provide  necessary 
and  sufficient  conditions  for  optimality:  Slater’s  condition  implies  that  the  optimal 
duality  gap  is  zero  and  the  dual  optimum  is  attained,  so  x  is  optimal  if  and  only  if 
there  are  (A,  v)  that,  together  with  x ,  satisfy  the  KKT  conditions. 

The  KKT  conditions  play  an  important  role  in  optimization.  In  a  few  special 
cases  it  is  possible  to  solve  the  KKT  conditions  (and  therefore,  the  optimization 
problem)  analytically.  More  generally,  many  algorithms  for  convex  optimization  are 
conceived  as,  or  can  be  interpreted  as,  methods  for  solving  the  KKT  conditions. 


Example  5.1  Equality  constrained  convex  quadratic  minimization.  We  consider  the 
problem 

minimize  (1/2 )xTPx  +  qTx  +  r 
,  .  ,  ,  \  ,  (5.50) 

subject  to  Ax  =  o, 

where  P  £  S".  The  KKT  conditions  for  this  problem  are 
Ax *  =  6,  Px*  +  q  +  AT  v*  =  0, 


'  p 

X * 

-q 

A 

0 

z/* 

b 

which  we  can  write  as 
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Solving  this  set  of  m  +  n  equations  in  the  m  +  n  variables  x* ,  v*  gives  the  optimal 
primal  and  dual  variables  for  (5.50). 


Example  5.2  Water- filling.  We  consider  the  convex  optimization  problem 

minimize  -  Yi=i  log(ai  +  *0 
subject  to  x  y  0,  1T x  =  1, 

where  a;  >  0.  This  problem  arises  in  information  theory,  in  allocating  power  to  a 
set  of  n  communication  channels.  The  variable  Xi  represents  the  transmitter  power 
allocated  to  the  ith  channel,  and  log(ai  +  Xi)  gives  the  capacity  or  communication 
rate  of  the  channel,  so  the  problem  is  to  allocate  a  total  power  of  one  to  the  channels, 
in  order  to  maximize  the  total  communication  rate. 

Introducing  Lagrange  multipliers  A*  £  R"  for  the  inequality  constraints  x*  y  0, 
and  a  multiplier  v*  £  R  for  the  equality  constraint  lTx  =  1,  we  obtain  the  KKT 
conditions 

x *  h  0,  1T x*  =  1,  A*  y  0,  A***  =  0,  i  =  1, . . . ,  n, 

—  l/(a;  +  x*)  —  A*  +  v*  =  0,  *  =  l,...,n. 

We  can  directly  solve  these  equations  to  find  x* ,  A*,  and  v* .  We  start  by  noting  that 
A*  acts  as  a  slack  variable  in  the  last  equation,  so  it  can  be  eliminated,  leaving 

x *  y  0,  1T x*  =  1,  x*  (y*  —  1  /(cii  +  x*))  =  0,  i  =  1, . . . ,  n, 

v*  >  l/(a-i  +  x*),  i  =  l,...,n. 

If  v*  <  1  /ai,  this  last  condition  can  only  hold  if  x *  >  0,  which  by  the  third  condition 
implies  that  v*  =  l/(ai  +  x*).  Solving  for  x *,  we  conclude  that  x*  =  1/u*  —  an 
if  v*  <  1/a,.  If  v*  >  l/«i,  then  **  >  0  is  impossible,  because  it  would  imply 
v*  >  1  /ai  >  l/(cti  +  x*),  which  violates  the  complementary  slackness  condition. 
Therefore,  x*  =  0  if  v*  >  l/a*.  Thus  we  have 

*_  f  l/v*  -  ai  v*  <  1/oti 

Xi  ~  \  0  n*  >  1  /ai, 

or,  put  more  simply,  x*  =  max{0, 1/u*  —  ai}.  Substituting  this  expression  for  x * 
into  the  condition  lTx*  =  1  we  obtain 

n 

max{0, 1/v*  —  ai}  =  1. 

i- 1 

The  lefthand  side  is  a  piecewise-linear  increasing  function  of  1/V*,  with  breakpoints 
at  ai,  so  the  equation  has  a  unique  solution  which  is  readily  determined. 

This  solution  method  is  called  water-filling  for  the  following  reason.  We  think  of 
ai  as  the  ground  level  above  patch  i,  and  then  flood  the  region  with  water  to  a 
depth  1/v,  as  illustrated  in  figure  5.7.  The  total  amount  of  water  used  is  then 
Y/f_,  max{0, 1/v*  —  ai}.  We  then  increase  the  flood  level  until  we  have  used  a  total 
amount  of  water  equal  to  one.  The  depth  of  water  above  patch  i  is  then  the  optimal 
value  x*. 
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Figure  5.7  Illustration  of  water-filling  algorithm.  The  height  of  each  patch  is 
given  by  on .  The  region  is  flooded  to  a  level  1/v*  which  uses  a  total  quantity 
of  water  equal  to  one.  The  height  of  the  water  (shown  shaded)  above  each 
patch  is  the  optimal  value  of  x*. 


w 


w 


x\ _ _ 

_ X2 _ 

l 


Figure  5.8  Two  blocks  connected  by  springs  to  each  other,  and  the  left  and 
right  walls.  The  blocks  have  width  w  >  0,  and  cannot  penetrate  each  other 
or  the  walls. 


5.5.4  Mechanics  interpretation  of  KKT  conditions 

The  KKT  conditions  can  be  given  a  nice  interpretation  in  mechanics  (which  indeed, 
was  one  of  Lagrange’s  primary  motivations).  We  illustrate  the  idea  with  a  simple 
example.  The  system  shown  in  figure  5.8  consists  of  two  blocks  attached  to  each 
other,  and  to  walls  at  the  left  and  right,  by  three  springs.  The  position  of  the 
blocks  are  given  by  a;  £  R2,  where  Xi  is  the  displacement  of  the  (middle  of  the)  left 
block,  and  X2  is  the  displacement  of  the  right  block.  The  left  wall  is  at  position  0, 
and  the  right  wall  is  at  position  l. 

The  potential  energy  in  the  springs,  as  a  function  of  the  block  positions,  is  given 

by  |  j 

fo(xi,x2)  =  -fcix2  +  7) k2{x2  -  Xi)2  +  -k3(l  -  x2)2, 

where  fc,;  >  0  are  the  stiffness  constants  of  the  three  springs.  The  equilibrium 
position  x*  is  the  position  that  minimizes  the  potential  energy  subject  to  the  in¬ 
equalities 


w/2  —  X\  <  0, 


w  +  x\  —  x2  <  0, 


w/2  —  l  +  x2  <  0. 


(5.51) 
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Ai 

k\X\ 


A2  A2 

k2(X2  —  xi)  fc2( X2~Xi) 


A3 

k3{l  -  x2) 


Figure  5.9  Force  analysis  of  the  block-spring  system.  The  total  force  on 
each  block,  due  to  the  springs  and  also  to  contact  forces,  must  be  zero.  The 
Lagrange  multipliers,  shown  on  top,  are  the  contact  forces  between  the  walls 
and  blocks.  The  spring  forces  are  shown  at  bottom. 


These  constraints  are  called  kinematic  constraints,  and  express  the  fact  that  the 
blocks  have  width  w  >  0,  and  cannot  penetrate  each  other  or  the  walls.  The 
equilibrium  position  is  therefore  given  by  the  solution  of  the  optimization  problem 


minimize  (1/2)  (k\x\  +  k2{x2  —  Xi)2  +  k3(l  —  x2)2) 
subject  to  w/2  —  X\  <  0 

w  +  Xi  —  x2  <  0 
w/2  —  l  +  X2  <  0, 


(5.52) 


which  is  a  QP. 

With  Ai,  A2,  A3  as  Lagrange  multipliers,  the  KKT  conditions  for  this  problem 
consist  of  the  kinematic  constraints  (5.51),  the  nonnegativity  constraints  A >  0, 
the  complementary  slackness  conditions 


\\{w/2  —  x\)  =  0,  A2(w  —  X2  +  x{)  =  0, 


A3(u>/2  —  l  +  X2)  =  0,  (5.53) 


and  the  zero  gradient  condition 


k\X\  -  k2(x 2  -  xi) 

k2{x2  -  xi)  -  k3(l  -  x2) 

+  Ai 

■  -1  ■ 
0 

+ A2 

1 

-1 

+ A3 

■  0 ' 
1 

=  0. 


(5.54) 


The  equation  (5.54)  can  be  interpreted  as  the  force  balance  equations  for  the  two 
blocks,  provided  we  interpret  the  Lagrange  multipliers  as  contact  forces  that  act 
between  the  walls  and  blocks,  as  illustrated  in  figure  5.9.  The  first  equation  states 
that  the  sum  of  the  forces  on  the  first  block  is  zero:  The  term  —  k\X\  is  the  force 
exerted  on  the  left  block  by  the  left  spring,  the  term  k2{x 2  —  aq)  is  the  force  exerted 
by  the  middle  spring,  Ai  is  the  force  exerted  by  the  left  wall,  and  —  A2  is  the  force 
exerted  by  the  right  block.  The  contact  forces  must  point  away  from  the  contact 
surface  (as  expressed  by  the  constraints  Ai  >  0  and  — A2  <  0),  and  are  nonzero 
only  when  there  is  contact  (as  expressed  by  the  first  two  complementary  slackness 
conditions  (5.53)).  In  a  similar  way,  the  second  equation  in  (5.54)  is  the  force 
balance  for  the  second  block,  and  the  last  condition  in  (5.53)  states  that  A3  is  zero 
unless  the  right  block  touches  the  wall. 

In  this  example,  the  potential  energy  and  kinematic  constraint  functions  are 
convex,  and  (the  refined  form  of)  Slater’s  constraint  qualification  holds  provided 
2 w  <  l,  i.e.,  there  is  enough  room  between  the  walls  to  fit  the  two  blocks,  so  we 
can  conclude  that  the  energy  formulation  of  the  equilibrium  given  by  (5.52),  gives 
the  same  result  as  the  force  balance  formulation,  given  by  the  KKT  conditions. 
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5.5.5  Solving  the  primal  problem  via  the  dual 

We  mentioned  at  the  beginning  of  §5.5.3  that  if  strong  duality  holds  and  a  dual 
optimal  solution  (A*,  v*)  exists,  then  any  primal  optimal  point  is  also  a  minimizer 
of  L(x ,  A*,  v*).  This  fact  sometimes  allows  us  to  compute  a  primal  optimal  solution 
from  a  dual  optimal  solution. 

Afore  precisely,  suppose  we  have  strong  duality  and  an  optimal  (A*,  v*)  is  known. 
Suppose  that  the  minimizer  of  L(x,  A*,  v*),  i.e.,  the  solution  of 

minimize  fo(x)  +  Y^1KMx)  +  Y^=iuthi(x)^  (5.55) 

is  unique.  (For  a  convex  problem  this  occurs,  for  example,  if  L(x,  A*,  v*)  is  a  strictly 
convex  function  of  x.)  Then  if  the  solution  of  (5.55)  is  primal  feasible,  it  must  be 
primal  optimal;  if  it  is  not  primal  feasible,  then  no  primal  optimal  point  can  exist, 
i.e.,  we  can  conclude  that  the  primal  optimum  is  not  attained.  This  observation  is 
interesting  when  the  dual  problem  is  easier  to  solve  than  the  primal  problem,  for 
example,  because  it  can  be  solved  analytically,  or  has  some  special  structure  that 
can  be  exploited. 


Example  5.3  Entropy  maximization.  We  consider  the  entropy  maximization  problem 

minimize  fo{x)  =  J2"=1  Xi  log  Xi 
subject  to  Ax  -<  b 
lTx  =  1 

with  domain  R.++,  and  its  dual  problem 

maximize  —b  A  —  v  —  e  >  .  ,  e  * 
subject  to  A  >z  0 

where  at  are  the  columns  of  A  (see  pages  222  and  228).  We  assume  that  the  weak 
form  of  Slater’s  condition  holds,  i.e.,  there  exists  an  x  >-  0  with  Ax  A  b  and  lTx  =  1, 
so  strong  duality  holds  and  an  optimal  solution  exists. 

Suppose  we  have  solved  the  dual  problem.  The  Lagrangian  at  (A*,  u*)  is 

n 

L{x,  A*,  v*)  =  Xi  logxi  +  A *T {Ax  —  6)  +  i/*(lTx  —  1) 

i=  1 

which  is  strictly  convex  on  T)  and  bounded  below,  so  it  has  a  unique  solution  x* , 
given  by 

x*  =  l/exp(af  A*  +  u*  +  1),  i=l,...,n. 

If  x*  is  primal  feasible,  it  must  be  the  optimal  solution  of  the  primal  problem  (5.13). 
If  x*  is  not  primal  feasible,  then  we  can  conclude  that  the  primal  optimum  is  not 
attained. 


Example  5.4  Minimizing  a  separable  function  subject  to  an  equality  constraint.  We 
consider  the  problem 

minimize  f0(x)  =  £)"=i  fi{xi) 
subject  to  aTx  =  b, 
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where  a  £  Ft",  b  £  R,  and  fi  :  Ft  — »  R  are  differentiable  and  strictly  convex.  The 
objective  function  is  called  separable  since  it  is  a  sum  of  functions  of  the  individual 
variables  Xi, . . .  ,xn.  We  assume  that  the  domain  of  fo  intersects  the  constraint  set, 
i.e.,  there  exists  a  point  xo  £  dom/o  with  aTx o  =  b.  This  implies  the  problem  has 
a  unique  optimal  point  x*. 

The  Lagrangian  is 

n  n 

L(x,  v)  =  ^2  fi(xi)  +  v(aT x  -  b)  =  —bu  +  ^(/i(*i)  +  va+Xi), 

i= 1  i=  1 

which  is  also  separable,  so  the  dual  function  is 


=  ~bv  -  ^  f*(-vai). 

i=  1 

The  dual  problem  is  thus 

maximize  —  bv  —  y^"_, 
with  (scalar)  variable  v  £  R. 

Now  suppose  we  have  found  an  optimal  dual  variable  v* .  (There  are  several  simple 
methods  for  solving  a  convex  problem  with  one  scalar  variable,  such  as  the  bisection 
method.)  Since  each  fi  is  strictly  convex,  the  function  L(x,v*)  is  strictly  convex  in 
x,  and  so  has  a  unique  minimizer  x.  But  we  also  know  that  x*  minimizes  L(x,u*), 
so  we  must  have  x  =  x*.  We  can  recover  x*  from  VxL(x,  u*)  =  0,  i.e.,  by  solving  the 
equations  //(**)  =  —v*ai. 


5.6  Perturbation  and  sensitivity  analysis 

When  strong  duality  obtains,  the  optimal  dual  variables  give  very  useful  informa¬ 
tion  about  the  sensitivity  of  the  optimal  value  with  respect  to  perturbations  of  the 
constraints. 


5.6.1  The  perturbed  problem 

We  consider  the  following  perturbed  version  of  the  original  optimization  prob¬ 
lem  (5.1): 

minimize  fo{x) 

subject  to  fi{x)  <Ui,  i  =  1, . . . ,  m 
hi(x)  =  Vi,  i  =  l,...,p, 


(5.56) 
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with  variable  x  £  R".  This  problem  coincides  with  the  original  problem  (5.1)  when 
u  =  0,  v  =  0.  When  tq  is  positive  it  means  that  we  have  relaxed  the  ith  inequality 
constraint;  when  u,:  is  negative,  it  means  that  we  have  tightened  the  constraint. 
Thus  the  perturbed  problem  (5.56)  results  from  the  original  problem  (5.1)  by  tight¬ 
ening  or  relaxing  each  inequality  constraint  by  iq,  and  changing  the  right  hand  side 
of  the  equality  constraints  by  n*. 

We  define  p*(u,v)  as  the  optimal  value  of  the  perturbed  problem  (5.56): 

p*(u,  v)  =  inf{/0(x)  \3x  eV,  fi(x)  <  tq,  i  = 
hi(x)  =vit  i  =  1, . . .  ,p}. 

We  can  have  p*(u,v)  =  oo,  which  corresponds  to  perturbations  of  the  constraints 
that  result  in  infeasibility.  Note  that  p*(0, 0)  =  p* ,  the  optimal  value  of  the  un¬ 
perturbed  problem  (5.1).  (We  hope  this  slight  abuse  of  notation  will  cause  no 
confusion.)  Roughly  speaking,  the  function  p *  :  Rm  x  Rp  — >  R  gives  the  optimal 
value  of  the  problem  as  a  function  of  perturbations  to  the  righthand  sides  of  the 
constraints. 

When  the  original  problem  is  convex,  the  function  p*  is  a  convex  function  of  u 
and  i>;  indeed,  its  epigraph  is  precisely  the  closure  of  the  set  A  defined  in  (5.37) 
(see  exercise  5.32). 


5.6.2  A  global  inequality 

Now  we  assume  that  strong  duality  holds,  and  that  the  dual  optimum  is  attained. 
(This  is  the  case  if  the  original  problem  is  convex,  and  Slater’s  condition  is  satisfied). 
Let  (A*,  is*)  be  optimal  for  the  dual  (5.16)  of  the  unperturbed  problem.  Then  for 
all  u  and  v  we  have 

p*(u,  v)  >  p*{ 0, 0)  —  A *Tu  —  v*T v.  (5.57) 

To  establish  this  inequality,  suppose  that  x  is  any  feasible  point  for  the  per¬ 
turbed  problem,  i.e.,  /*( x)  <  for  i  =  1, . . . ,  to,  and  hi(x)  =  Vi  for  i  =  l, ...  ,p. 
Then  we  have,  by  strong  duality, 

m  p 

P*(0,0)  =  ff(A*,is*)  <  f0(x) +  J^A*ft(x) +  j^is*hi(x) 

i= 1  i=l 

<  fo(x)  +  X*Tu  +  is*Tv. 

(The  first  inequality  follows  from  the  definition  of  g(A*,is*);  the  second  follows 
since  A*  y  0.)  We  conclude  that  for  any  x  feasible  for  the  perturbed  problem,  we 
have 

fo{x)  >  p*(0,0)  -  A *Tu-  v*Tv , 
from  which  (5.57)  follows. 

Sensitivity  interpretations 

When  strong  duality  holds,  various  sensitivity  interpretations  of  the  optimal  La¬ 
grange  variables  follow  directly  from  the  inequality  (5.57).  Some  of  the  conclusions 


are: 
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Figure  5.10  Optimal  value  p*(u)  of  a  convex  problem  with  one  constraint 
fi(x)  <  u,  as  a  function  of  u.  For  u  =  0,  we  have  the  original  unperturbed 
problem;  for  u  <  0  the  constraint  is  tightened,  and  for  u  >  0  the  constraint 
is  loosened.  The  affine  function  p*( 0)  —  X*u  is  a  lower  bound  on  p* . 


•  If  A*  is  large  and  we  tighten  the  zth  constraint  (i.e.,  choose  U{  <  0),  then  the 
optimal  value  p*(u,v)  is  guaranteed  to  increase  greatly. 

•  If  v*  is  large  and  positive  and  we  take  Vi  <  0,  or  if  v*  is  large  and  negative 
and  we  take  v-i  >  0,  then  the  optimal  value  p*(u,  v)  is  guaranteed  to  increase 
greatly. 

•  If  A*  is  small,  and  we  loosen  the  zth  constraint  (tt,  >  0),  then  the  optimal 
value  p*(u,v)  will  not  decrease  too  much. 

•  If  v*  is  small  and  positive,  and  >  0,  or  if  v*  is  small  and  negative  and 
Vi  <  0,  then  the  optimal  value  p*(u,v)  will  not  decrease  too  much. 

The  inequality  (5.57),  and  the  conclusions  listed  above,  give  a  lower  bound  on 
the  perturbed  optimal  value,  but  no  upper  bound.  For  this  reason  the  results  are 
not  symmetric  with  respect  to  loosening  or  tightening  a  constraint.  For  example, 
suppose  that  A*  is  large,  and  we  loosen  the  ztli  constraint  a  bit  (i.e.,  take  zz,;  small 
and  positive).  In  this  case  the  inequality  (5.57)  is  not  useful;  it  does  not,  for 
example,  imply  that  the  optimal  value  will  decrease  considerably. 

The  inequality  (5.57)  is  illustrated  in  figure  5.10  for  a  convex  problem  with  one 
inequality  constraint.  The  inequality  states  that  the  affine  function  p*( 0)  —  A *u  is 
a  lower  bound  on  the  convex  function  p*. 


5.6.3  Local  sensitivity  analysis 

Suppose  now  that  p*(u,v)  is  differentiable  at  u  =  0,  v  =  0.  Then,  provided  strong 
duality  holds,  the  optimal  dual  variables  A*,  v*  are  related  to  the  gradient  of  p*  at 
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u  =  0,  v  =  0: 

dp*  (0,0)  *  dp*  (0,0) 

1  dm  ’  Vi  dvi 


(5.58) 


This  property  can  be  seen  in  the  example  shown  in  figure  5.10,  where  —A*  is  the 
slope  of  p*  near  u  =  0. 

Thus,  when  p*(u,v)  is  differentiable  at  u  =  0,  v  =  0,  and  strong  duality  holds, 
the  optimal  Lagrange  multipliers  are  exactly  the  local  sensitivities  of  the  optimal 
value  with  respect  to  constraint  perturbations.  In  contrast  to  the  nondifferentiable 
case,  this  interpretation  is  symmetric:  Tightening  the  ith  inequality  constraint 
a  small  amount  (i.e.,  taking  tq  small  and  negative)  yields  an  increase  in  p*  of 
approximately  —A *tq;  loosening  the  ith  constraint  a  small  amount  (i.e.,  taking  u, 
small  and  positive)  yields  a  decrease  in  p*  of  approximately  A* ut . 

To  show  (5.58),  suppose  p*(u,v)  is  differentiable  and  strong  duality  holds.  For 
the  perturbation  u  =  tei,  v  =  0,  where  e,  is  the  ith  unit  vector,  we  have 


p*(tei,  0)  —  p*  dp*  (0,0) 

inn -  =  - - - 

t->  o  t  oiLi 


The  inequality  (5.57)  states  that  for  t  >  0, 


p*(tei,0)-p*  >  _A* 
t.  - 

while  for  t  <  0  we  have  the  opposite  inequality.  Taking  the  limit  t  — >  0,  with  t  >  0, 
yields 

3p*M  ^  „ 

while  taking  the  limit  with  t  <  0  yields  the  opposite  inequality,  so  we  conclude  that 

dp*( 0,0)  _  ^ 

dm  '*■ 

The  same  method  can  be  used  to  establish 

dp*  (0,0)  _  * 

dvi  Ui  ‘ 


The  local  sensitivity  result  (5.58)  gives  us  a  quantitative  measure  of  how  active 
a  constraint  is  at  the  optimum  x* .  If  fi(x*)  <  0,  then  the  constraint  is  inactive, 
and  it  follows  that  the  constraint  can  be  tightened  or  loosened  a  small  amount 
without  affecting  the  optimal  value.  By  complementary  slackness,  the  associated 
optimal  Lagrange  multiplier  must  be  zero.  But  now  suppose  that  fi(x*)  =  0,  i.e., 
the  ?’th  constraint  is  active  at  the  optimum.  The  ith  optimal  Lagrange  multiplier 
tells  us  how  active  the  constraint  is:  If  A*  is  small,  it  means  that  the  constraint 
can  be  loosened  or  tightened  a  bit  without  much  effect  on  the  optimal  value;  if  A* 
is  large,  it  means  that  if  the  constraint  is  loosened  or  tightened  a  bit,  the  effect  on 
the  optimal  value  will  be  great. 
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Shadow  price  interpretation 

We  can  also  give  a  simple  geometric  interpretation  of  the  result  (5.58)  in  terms 
of  economics.  We  consider  (for  simplicity)  a  convex  problem  with  no  equality 
constraints,  which  satisfies  Slater’s  condition.  The  variable  x  G  Rm  determines 
how  a  firm  operates,  and  the  objective  fo  is  the  cost,  i.e.,  —  fo  is  the  profit.  Each 
constraint  fi(x)  <  0  represents  a  limit  on  some  resource  such  as  labor,  steel,  or 
warehouse  space.  The  (negative)  perturbed  optimal  cost  function  —p*(u)  tells  us 
how  much  more  or  less  profit  could  be  made  if  more,  or  less,  of  each  resource  were 
made  available  to  the  firm.  If  it  is  differentiable  near  u  =  0,  then  we  have 

A*_  dp*{  0) 

*  duz  ' 

In  other  words,  A*  tells  us  approximately  how  much  more  profit  the  firm  could 
make,  for  a  small  increase  in  availability  of  resource  i. 

It  follows  that  A*  would  be  the  natural  or  equilibrium  price  for  resource  i,  if 
it  were  possible  for  the  firm  to  buy  or  sell  it.  Suppose,  for  example,  that  the  firm 
can  buy  or  sell  resource  i,  at  a  price  that  is  less  than  A*.  In  this  case  it  would 
certainly  buy  some  of  the  resource,  which  would  allow  it  to  operate  in  a  way  that 
increases  its  profit  more  than  the  cost  of  buying  the  resource.  Conversely,  if  the 
price  exceeds  A*,  the  firm  would  sell  some  of  its  allocation  of  resource  i,  and  obtain 
a  net  gain  since  its  income  from  selling  some  of  the  resource  would  be  larger  than 
its  drop  in  profit  due  to  the  reduction  in  availability  of  the  resource. 


5.7  Examples 

In  this  section  we  show  by  example  that  simple  equivalent  reformulations  of  a 
problem  can  lead  to  very  different  dual  problems.  We  consider  the  following  types 
of  reformulations: 

•  Introducing  new  variables  and  associated  equality  constraints. 

•  Replacing  the  objective  with  an  increasing  function  of  the  original  objective. 

•  Making  explicit  constraints  implicit,  i.e.,  incorporating  them  into  the  domain 
of  the  objective. 


5.7.1  Introducing  new  variables  and  equality  constraints 

Consider  an  unconstrained  problem  of  the  form 

minimize  fo(Ax  +  b).  (5.59) 

Its  Lagrange  dual  function  is  the  constant  p*.  So  while  we  do  have  strong  duality, 
i.e.,  p*  =  d*,  the  Lagrangian  dual  is  neither  useful  nor  interesting. 
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Now  let  us  reformulate  the  problem  (5.59)  as 

minimize  ft(y)  (5.60) 

subject  to  Ax  +  b  =  y. 

Here  we  have  introduced  new  variables  y ,  as  well  as  new  equality  constraints  Ax  + 
b  =  y.  The  problems  (5.59)  and  (5.60)  are  clearly  equivalent. 

The  Lagrangian  of  the  reformulated  problem  is 

L(x,  y ,  is)  =  f0(y)  +  ist(Ax  +  b-y). 


To  find  the  dual  function  we  minimize  L  over  x  and  y.  Minimizing  over  x  we  find 
that  g(is)  =  —  oo  unless  ATis  =  0,  in  which  case  we  are  left  with 

g(v)  =  bTis  +  inf(/0(y)  -  isTy)  =  bT  is  -  ft  (is), 
v 

where  /q  is  the  conjugate  of  ft-  The  dual  problem  of  (5.60)  can  therefore  be 
expressed  as 

maximize  bT  v  —  /q  (is)  .  , 

subject  to  ATis  =  0.  '  '  * 

Thus,  the  dual  of  the  reformulated  problem  (5.60)  is  considerably  more  useful  than 
the  dual  of  the  original  problem  (5.59). 


Example  5.5  Unconstrained  geometric  program.  Consider  the  unconstrained  geomet¬ 
ric  program 

minimize  log  (y~)™  ,  exp(a  Jx  +  bt))  . 

We  first  reformulate  it  by  introducing  new  variables  and  equality  constraints: 

minimize  f0(y)  =  log  (J™  i  exp yt) 
subject  to  Ax  +  b  =  y, 


where  af  are  the  rows  of  A.  The  conjugate  of  the  log-sum-exp  function  is 


ft  W)  = 


Y1T=  i  Ui  lo8  vi  v  ^  0,  1 TV  =  1 

oo  otherwise 


(example  3.25,  page  93),  so  the  dual  of  the  reformulated  problem  can  be  expressed 
as 


maximize  bT is  —  'ft,  ft,  Vj  log  Ui 

subject  to  1  Tis  =  1 
Atv  =  0 


(5.62) 


v  y  o, 


which  is  an  entropy  maximization  problem. 


Example  5.6  Norm  approximation  problem.  We  consider  the  unconstrained  norm 
approximation  problem 

minimize  ||j4x  —  6||,  (5.63) 

where  ||  •  ||  is  any  norm.  Here  too  the  Lagrange  dual  function  is  constant,  equal  to 
the  optimal  value  of  (5.63),  and  therefore  not  useful. 
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Once  again  we  reformulate  the  problem  as 

minimize  ||y|| 
subject  to  Ax  —  b  =  y. 

The  Lagrange  dual  problem  is,  following  (5.61), 

maximize  bT  v 

subject  to  |M|*  <  1  (5.64) 

ATv  =  0, 

where  we  use  the  fact  that  the  conjugate  of  a  norm  is  the  indicator  function  of  the 
dual  norm  unit  ball  (example  3.26,  page  93). 


The  idea  of  introducing  new  equality  constraints  can  be  applied  to  the  constraint 
functions  as  well.  Consider,  for  example,  the  problem 

minimize  fo(A0x  +  b0)  (5  65) 

subject  to  fi(A.iX  +  6*)  <  0,  i  =  1, . . . ,  m,  '  ' 

where  At  £  R,k'xn  and  f,  :  Rfci  ->  R  are  convex.  (For  simplicity  we  do  not  include 
equality  constraints  here.)  We  introduce  a  new  variable  yi  £  Rki,  for  i  =  0, ...  ,m, 
and  reformulate  the  problem  as 

minimize  fo(yo) 

subject  to  fi(yi)  <0,  i  =  1, , . .  ,m  (5.66) 

AiX  +  bi  =  yit  i  =  0, 


The  Lagrangian  for  this  problem  is 

m  m 

L{x ,  y0,  ■  ■  ■ ,  ym,  A,  u0, . . . ,  vm)  =  f0(y0 )  +  Y  \ifi(yi)  +  Y  vj (Atx  +  h  -  y,;). 

i= 1  i= 0 

To  find  the  dual  function  we  minimize  over  x  and  yi.  The  minimum  over  x  is  — oo 
unless 

m 

Y[' = °> 

i=0 

in  which  case  we  have,  for  A  >-  0, 


g(\  VO,  ■  ■  ■  1vm) 


T 

Vi  Vi 


=  Y  yTi  bi  +  inf  /o (2/0 )  +  Y  )  -  Y  ‘ 

i= 0  y  \  i=l  i=0  / 

m,  m 

=  Y  bi  +  inf  (/o(yo)  -  vo  Vo)  +  Y  inf  (MVi)  -  (Ui/Xi)Tyi) 

t—0  4=1 

m  m 

=  Y  bi  ~  fo  K)  -  Y  Xifi  ("i/Xi)  ■ 


i= 0 
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The  last  expression  involves  the  perspective  of  the  conjugate  function,  and  is  there¬ 
fore  concave  in  the  dual  variables.  Finally,  we  address  the  question  of  what  happens 
when  A  ^  0,  but  some  A $  are  zero.  If  A =  0  and  vt  7^  0,  then  the  dual  function  is 
—00.  If  A i  =  0  and  1 /j  =  0,  however,  the  terms  involving  yi,  z/j,  and  Xi  are  all  zero. 
Thus,  the  expression  above  for  g  is  valid  for  all  A  y  0,  if  we  take  A,/*  (z^/A,)  =  0 
when  Xi  =  0  and  z/j  =  0,  and  A,/*  (z/j /A,)  =  00  when  Aj  =  0  and  z/j  ^  0. 

Therefore  we  can  express  the  dual  of  the  problem  (5.66)  as 

maximize  EH  0  uTbi  -  fo  (To)  -  EEi  \f*  {vi/\) 

subject  to  A  y  0  (5.67) 

Y!iLoAJvi  =  °- 


Example  5.7  Inequality  constrained  geometric  program.  The  inequality  constrained 
geometric  program 


minimize  log  ^EEi  eaokX+bok'j 

subject  to  log  ^EiHi  eau°x+bik  j  <0,  *  =  !,..., to 


is  of  the  form  (5.65)  with  /»  :  RAi  — ¥  R  given  by  fi(y)  =  log  (E/Hi  eVk)  ■  The 
conjugate  of  this  function  is 


1  00 


1/X0,  l1  v  =  1 
otherwise. 


Using  (5.67)  we  can  immediately  write  down  the  dual  problem  as 


maximize 
subject  to 


bo  V0  -  Ef=i  u0k  logz/ofc  +  EHi  {bivi  -  Ef=i  \og(vik/Xij) 
v0  y  0,  lTv0  =  1 


Vi  y  0,  1  Vi  =  Xi,  i  =  1, 

Ai  >  0,  i  =  1, . . . , m 

e:=0^=o, 


,  m 


which  further  simplifies  to 

maximize  b^vo  -  Ef=i  u°k  log^ofc  +  EHi  “  Ek=i  Vik  \og(vik/lTi /»)) 
subject  to  Vi  y  0,  i  =  0, . . . ,  m 
lTvo  =  1 

Er=0^  =  o. 


5.7.2  Transforming  the  objective 

If  we  replace  the  objective  fo  by  an  increasing  function  of  fo,  the  resulting  problem 
is  clearly  equivalent  (see  §4.1.3).  The  dual  of  this  equivalent  problem,  however,  can 
be  very  different  from  the  dual  of  the  original  problem. 


Example  5.8  We  consider  again  the  minimum  norm  problem 


minimize  ||j4x  —  6||, 
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where  ||  •  ||  is  some  norm.  We  reformulate  this  problem  as 

minimize  (l/2)||j/||2 
subject  to  Ax  —  b  =  y. 

Here  we  have  introduced  new  variables,  and  replaced  the  objective  by  half  its  square. 
Evidently  it  is  equivalent  to  the  original  problem. 

The  dual  of  the  reformulated  problem  is 

maximize  — (l/2)||i/||*  +  bT  v 
subject  to  ATv  =  0, 

where  we  use  the  fact  that  the  conjugate  of  (1/2)||  •  ||2  is  (1/2)||  •  ||2  (see  example  3.27, 
page  93). 

Note  that  this  dual  problem  is  not  the  same  as  the  dual  problem  (5.64)  derived  earlier. 


5.7.3  Implicit  constraints 

The  next  simple  reformulation  we  study  is  to  include  some  of  the  constraints  in 
the  objective  function,  by  modifying  the  objective  function  to  be  infinite  when  the 
constraint  is  violated. 


Example  5.9  Linear  program  with  box  constraints.  We  consider  the  linear  program 

minimize  cTx 

subject  to  Ax  =  b  (5.68) 

l  <  x  <u 

where  A  £  Rpxn  and  l  -<  u.  The  constraints  l  <  x  <  u  are  sometimes  called  box 
constraints  or  variable  bounds. 

We  can,  of  course,  derive  the  dual  of  this  linear  program.  The  dual  will  have  a 
Lagrange  multiplier  v  associated  with  the  equality  constraint,  Ai  associated  with  the 
inequality  constraint  x  <  u,  and  A2  associated  with  the  inequality  constraint  l  A  x. 
The  dual  is 

maximize  —  bT  v  —  A^it  +  \^l 

subject  to  ATu  +  Ai  —  A2  +  c  =  0  (5.69) 

Ai  8  0,  A2  (Z  0. 

Instead,  let  us  first  reformulate  the  problem  (5.68)  as 

minimize  /„(*) 
subject  to  Ax  =  0, 

l  <  x  <u 
otherwise. 

(5.68);  we  have  merely  made  the  explicit 


where  we  define 


fo(x)  = 


The  problem  (5.70)  is  clearly  equivalent  to 
box  constraints  implicit. 
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The  dual  function  for  the  problem  (5.70)  is 

g(u)  =  inf  (cTx  +  uT  (Ax  —  b)) 

l^X^U 

=  —bTv  —  UT  (AT  v  +  c)~  +  lT(ATu  +  c)+ 

where  yf  =  max{t/;,0},  y~  =  max{— So  here  we  are  able  to  derive  an  analyt¬ 
ical  formula  for  g,  which  is  a  concave  piecewise-linear  function. 

The  dual  problem  is  the  unconstrained  problem 

maximize  —bTi>  —  uT(ATu  +  c)~+lT(ATi/  +  c)+,  (5-71) 

which  has  a  quite  different  form  from  the  dual  of  the  original  problem. 

(The  problems  (5.69)  and  (5.71)  are  closely  related,  in  fact,  equivalent;  see  exer¬ 
cise  5.8.) 


5.8  Theorems  of  alternatives 

5.8.1  Weak  alternatives  via  the  dual  function 

In  this  section  we  apply  Lagrange  duality  theory  to  the  problem  of  determining 
feasibility  of  a  system  of  inequalities  and  equalities 

fi{x)<  0,  i  =  1, . . .  ,m,  hi(x)  =  0,  i=l,...,p.  (5.72) 

We  assume  the  domain  of  the  inequality  system  (5.72),  T>  =  f'|.™1dom/j  fl 
nf=i  domh  i,  is  nonempty.  We  can  think  of  (5.72)  as  the  standard  problem  (5.1), 
with  objective  fo  =  0,  i.e., 

minimize  0 

subject  to  fi{x)  <0,  i  =  1, . . . ,  m 
hi(x)  =  0,  i  =  l,...,p. 

This  problem  has  optimal  value 

*  _  f  0  (5-72)  is  feasible 

P  \  oo  (5.72)  is  infeasible, 

so  solving  the  optimization  problem  (5.73)  is  the  same  as  solving  the  inequality 
system  (5.72). 


(5.73) 


(5.74) 


The  dual  function 

We  associate  with  the  inequality  system  (5.72)  the  dual  function 
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which  is  the  same  as  the  dual  function  for  the  optimization  problem  (5.73).  Since 
/o  =  0,  the  dual  function  is  positive  homogeneous  in  (A,  v):  For  a  >  0,  g(a A,  an)  = 
ag{  A,  iz).  The  dual  problem  associated  with  (5.73)  is  to  maximize  g( \,v)  subject 
to  A  y  0.  Since  g  is  homogeneous,  the  optimal  value  of  this  dual  problem  is  given 
by 


J  oo  A  y  0,  g{ A,  v)  >  0  is  feasible 

(  0  A  >:  0,  g( A,  v)  >  0  is  infeasible. 


(5.75) 


Weak  duality  tells  us  that  d*  <  p* .  Combining  this  fact  with  (5.74)  and  (5.75) 
yields  the  following:  If  the  inequality  system 


A  >:  0,  g( A,  v)  >  0 


(5.76) 


is  feasible  (which  means  d*  =  oo),  then  the  inequality  system  (5.72)  is  infeasible 
(since  we  then  have  p*  =  oo).  Indeed,  we  can  interpret  any  solution  (A,  v)  of  the 
inequalities  (5.76)  as  a  proof  or  certificate  of  infeasibility  of  the  system  (5.72). 

We  can  restate  this  implication  in  terms  of  feasibility  of  the  original  system:  If 
the  original  inequality  system  (5.72)  is  feasible,  then  the  inequality  system  (5.76) 
must  be  infeasible.  We  can  interpret  an  x  which  satisfies  (5.72)  as  a  certificate 
establishing  infeasibility  of  the  inequality  system  (5.76). 

Two  systems  of  inequalities  (and  equalities)  are  called  weak  alternatives  if  at 
most  one  of  the  two  is  feasible.  Thus,  the  systems  (5.72)  and  (5.76)  are  weak 
alternatives.  This  is  true  whether  or  not  the  inequalities  (5.72)  are  convex  {i.e., 
fi  convex,  hi  affine);  moreover,  the  alternative  inequality  system  (5.76)  is  always 
convex  {i.e.,  g  is  concave  and  the  constraints  A,  >  0  are  convex). 


Strict  inequalities 

We  can  also  study  feasibility  of  the  strict  inequality  system 

fi(x)  <  0,  i  =  1, . . . ,  m,  hi{x)  =  0,  i  =  l,...,p.  (5.77) 

With  g  defined  as  for  the  nonstrict  inequality  system,  we  have  the  alternative 
inequality  system 

A  y  0,  A^O,  g(\,  v)  >  0.  (5.78) 

We  can  show  directly  that  (5.77)  and  (5.78)  are  weak  alternatives.  Suppose  there 
exists  an  x  with  /,  (x)  <  0,  h-f  x)  =  0.  Then  for  any  A  >  0,  A  ^  0,  and  v, 

Ai/i(x)  H - h  A mfm(x)  +  f\hx{x)  H - b  vphp{x)  <  0. 

It  follows  that 

(m  p 

V'  A  ifi{x)  +  Vihi{x) 

Z ✓  Z ✓ 

2=1  2=1 

m  p 

<  a ifi(x) + Vihi{x) 

i=l  i=l 

0. 


< 
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Therefore,  feasibility  of  (5.77)  implies  that  there  does  not  exist  (A,  u)  satisfy¬ 
ing  (5.78). 

Thus,  we  can  prove  infeasibility  of  (5.77)  by  producing  a  solution  of  the  sys¬ 
tem  (5.78);  we  can  prove  infeasibility  of  (5.78)  by  producing  a  solution  of  the 
system  (5.77). 


5.8.2  Strong  alternatives 

When  the  original  inequality  system  is  convex,  i.e.,  fi  are  convex  and  ht  are  affine, 
and  some  type  of  constraint  qualification  holds,  then  the  pairs  of  weak  alternatives 
described  above  are  strong  alternatives,  which  means  that  exactly  one  of  the  two 
alternatives  holds.  In  other  words,  each  of  the  inequality  systems  is  feasible  if  and 
only  if  the  other  is  infeasible. 

In  this  section  we  assume  that  fi  are  convex  and  hi  are  affine,  so  the  inequality 
system  (5.72)  can  be  expressed  as 

fi(x)  <0,  i  =  1, . . . ,  m,  Ax  =  b, 

where  A  G  Rpxn. 

Strict  inequalities 

We  first  study  the  strict  inequality  system 

fi{x)  <0,  i  =  1, . . . ,  m,  Ax  =  b,  (5.79) 

and  its  alternative 

A^O,  A^O,  g(X,  v)  >  0.  (5.80) 

We  need  one  technical  condition:  There  exists  an  x  G  relint  V  with  Ax  =  b.  In 
other  words  we  not  only  assume  that  the  linear  equality  constraints  are  consistent, 
but  also  that  they  have  a  solution  in  relint  T>.  (Very  often  V  =  R",  so  the  condition 
is  satisfied  if  the  equality  constraints  are  consistent.)  Under  this  condition,  exactly 
one  of  the  inequality  systems  (5.79)  and  (5.80)  is  feasible.  In  other  words,  the 
inequality  systems  (5.79)  and  (5.80)  are  strong  alternatives. 

We  will  establish  this  result  by  considering  the  related  optimization  problem 

minimize  s 

subject  to  fi(x)  —  s  <  0,  i  =  l,...,m  (5.81) 

Ax  =  b 


with  variables  x,  s,  and  domain  T>  x  R.  The  optimal  value  p*  of  this  problem  is 
negative  if  and  only  if  there  exists  a  solution  to  the  strict  inequality  system  (5.79). 
The  Lagrange  dual  function  for  the  problem  (5.81)  is 


inf 

x£T),  s 


i= 1 


s)  +  vt(Ax 


g{x,  v)  iTA  =  l 
—oo  otherwise. 
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Therefore  we  can  express  the  dual  problem  of  (5.81)  as 

maximize  g( A,  v) 
subject  to  A  ^  0,  1TA  =  1. 

Now  we  observe  that  Slater’s  condition  holds  for  the  problem  (5.81).  By  the 
hypothesis  there  exists  an  x  £  relintP  with  Ax  =  b.  Choosing  any  s  >  maxi  fi(x) 
yields  a  point  ( x ,  s)  which  is  strictly  feasible  for  (5.81).  Therefore  we  have  d*  =  p *, 
and  the  dual  optimum  d*  is  attained.  In  other  words,  there  exist  (A*,  v*)  such  that 

g(\*,v*)  =p*,  A*  y  0,  1tA*  =  1.  (5.82) 

Now  suppose  that  the  strict  inequality  system  (5.79)  is  infeasible,  which  means  that 
p *  >  0.  Then  from  (5.82)  satisfy  the  alternate  inequality  system  (5.80). 

Similarly,  if  the  alternate  inequality  system  (5.80)  is  feasible,  then  d*  =  p*  > 
0,  which  shows  that  the  strict  inequality  system  (5.79)  is  infeasible.  Thus,  the 
inequality  systems  (5.79)  and  (5.80)  are  strong  alternatives;  each  is  feasible  if  and 
only  if  the  other  is  not. 

Nonstrict  inequalities 

We  now  consider  the  nonstrict  inequality  system 

fi(x)  <0,  i  =  1, . . . ,  m,  Ax  =  b,  (5.83) 

and  its  alternative 

A^0,  g(X,  v)  >  0.  (5.84) 

We  will  show  these  are  strong  alternatives,  provided  the  following  conditions  hold: 
There  exists  an  x  £  relint  T>  with  Ax  =  b ,  and  the  optimal  value  p*  of  (5.81)  is 
attained.  This  holds,  for  example,  if  T>  =  R"  and  max,  fi(x)  — >  oo  as  x  — >  oo. 
With  these  assumptions  we  have,  as  in  the  strict  case,  that  p*  =  d* ,  and  that  both 
the  primal  and  dual  optimal  values  are  attained.  Now  suppose  that  the  nonstrict 
inequality  system  (5.83)  is  infeasible,  which  means  that  p*  >  0.  (Here  we  use  the 
assumption  that  the  primal  optimal  value  is  attained.)  Then  (A*,  v*)  from  (5.82) 
satisfy  the  alternate  inequality  system  (5.84).  Thus,  the  inequality  systems  (5.83) 
and  (5.84)  are  strong  alternatives;  each  is  feasible  if  and  only  if  the  other  is  not. 


5.8.3  Examples 


Linear  inequalities 

Consider  the  system  of  linear  inequalities  Ax  <  b.  The  dual  function  is 


g(  A)  =  inf  XT  (Ax  —  b)  = 


-bT  A  HtA  =  0 
—oo  otherwise. 


The  alternative  inequality  system  is  therefore 
A  h  0,  AT A  =  0, 


bT A  <  0. 
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These  are,  in  fact,  strong  alternatives.  This  follows  since  the  optimum  in  the  related 
problem  (5.81)  is  achieved,  unless  it  is  unbounded  below. 

We  now  consider  the  system  of  strict  linear  inequalities  Ax  -<  b ,  which  has  the 
strong  alternative  system 

A  ^  0,  A  ^  0,  AT A  =  0,  bT A  <  0. 

In  fact  we  have  encountered  (and  proved)  this  result  before,  in  §2.5.1;  see  (2.17) 
and  (2.18)  (on  page  50). 

Intersection  of  ellipsoids 

We  consider  m  ellipsoids,  described  as 

£i  =  {%  I  fi{x)  <  0}, 

with  fi(x)  =  xT AiX  +  2bfx  +  c*,  i  =  1 , ,m,  where  Aj  G  S"  +  .  We  ask  when 
the  intersection  of  these  ellipsoids  has  nonempty  interior.  This  is  equivalent  to 
feasibility  of  the  set  of  strict  quadratic  inequalities 

fi(x)  =  xT  AiX  +  2bf  x  +  Ci  <  0,  i  =  l,  (5.85) 

The  dual  function  g  is 

g{ A)  =  inf  (xT A(X)x  +  2b(X)Tx  +  c(A)) 

f  -6(A)TA(A)t6(A)  +  c(A)  A( A)b0,  6(A)  e  TZ(A(X)) 

\  —oo  otherwise, 

where 

m  mm 

A{ A)  =  Y,  &(A)  =  Ai&i,  c( A)  =  Y  x^- 

i= 1  i—1  2=1 

Note  that  for  A  ^  0,  A  ^  0,  we  have  A(X)  >-  0,  so  we  can  simplify  the  expression 
for  the  dual  function  as 

5(A)  =  — 6(A)tA(A)_16(A)  +  c(A). 

The  strong  alternative  of  the  system  (5.85)  is  therefore 

A^O,  A^O,  —b(\)T A(\)~1b(\)  +  c(A)  >  0.  (5.86) 

We  can  give  a  simple  geometric  interpretation  of  this  pair  of  strong  alternatives. 
For  any  nonzero  A  >;  0,  the  (possibly  empty)  ellipsoid 

£\  =  {x  |  xT A{ \)x  +  2b(X)Tx  +  c( A)  <  0} 

contains  £\  fl  •  •  •  fl  £m\  since  fi(x)  <  0  implies  Y^iLixifi(x)  —  0-  Now,  £\  has 
empty  interior  if  and  only  if 

inf  (xT A{X)x  +  2b(X)Tx  +  c(A))  =  — 6(A)T^4(A)^16(A)  +  c(A)  >  0. 

Therefore  the  alternative  system  (5.86)  means  that  £\  has  empty  interior. 

Weak  duality  is  obvious:  If  (5.86)  holds,  then  £\  contains  the  intersection  £\  fl 
•  •  •  fl  £rn ,  and  has  empty  interior,  so  naturally  the  intersection  has  empty  interior. 
The  fact  that  these  are  strong  alternatives  states  the  (not  obvious)  fact  that  if  the 
intersection  £ i  fl  ■  •  •  fl  £rn  has  empty  interior,  then  we  can  construct  an  ellipsoid  £\ 
that  contains  the  intersection  and  has  empty  interior. 
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Farkas’  lemma 

In  this  section  we  describe  a  pair  of  strong  alternatives  for  a  mixture  of  strict  and 
nonstrict  linear  inequalities,  known  as  Farkas  ’  lemma:  The  system  of  inequalities 

Ax  A  0,  cTx  <  0,  (5.87) 

where  A  £  Rmx”  and  c  £  Rn ,  and  the  system  of  equalities  and  inequalities 

ATy  +  c=  0,  yh  0,  (5.88) 


are  strong  alternatives. 

We  can  prove  Farkas’  lemma  directly,  using  LP  duality.  Consider  the  LP 

minimize  cTx 
subject  to  Ax  A  0, 

and  its  dual 

maximize  0 

subject  to  ATy  +  c  =  0  (5.90) 

yh  0. 

The  primal  LP  (5.89)  is  homogeneous,  and  so  has  optimal  value  0,  if  (5.87)  is 
not  feasible,  and  optimal  value  —  oo,  if  (5.87)  is  feasible.  The  dual  LP  (5.90)  has 
optimal  value  0,  if  (5.88)  is  feasible,  and  optimal  value  — oo,  if  (5.88)  is  infeasible. 

Since  x  =  0  is  feasible  in  (5.89),  we  can  rule  out  the  one  case  in  which  strong 
duality  can  fail  for  LPs,  so  we  must  have  p*  =  d* .  Combined  with  the  remarks 
above,  this  shows  that  (5.87)  and  (5.88)  are  strong  alternatives. 


Example  5.10  Arbitrage-free  bounds  on  price.  We  consider  a  set  of  n  assets,  with 
prices  at  the  beginning  of  an  investment  period  pi,.. .  ,pn,  respectively.  At  the  end 
of  the  investment  period,  the  value  of  the  assets  is  in, . . . ,  vn.  If  xi, . . . ,  xn  represents 
the  initial  investment  in  each  asset  (with  Xj  <  0  meaning  a  short  position  in  asset  j), 
the  cost  of  the  initial  investment  is  pTx,  and  the  final  value  of  the  investment  is  vTx. 

The  value  of  the  assets  at  the  end  of  the  investment  period,  v,  is  uncertain.  We  will 
assume  that  only  m  possible  scenarios,  or  outcomes,  are  possible.  If  outcome  i  occurs, 
the  final  value  of  the  assets  is  v^\  and  therefore,  the  overall  value  of  the  investments 
is  v^T x. 

If  there  is  an  investment  vector  x  with  pTx  <  0,  and  in  all  possible  scenarios,  the 
final  value  is  nonnegative,  i.e.,  v^Tx  >  0  for  i  =  1 ,m,  then  an  arbitrage  is  said 
to  exist.  The  condition  pTx  <  0  means  you  are  paid  to  accept  the  investment  mix, 
and  the  condition  v^Tx  >  0  for  i  =  1, . . .  ,  m  means  that  no  matter  what  outcome 
occurs,  the  final  value  is  nonnegative,  so  an  arbitrage  corresponds  to  a  guaranteed 
money-making  investment  strategy.  It  is  generally  assumed  that  the  prices  and  values 
are  such  that  no  arbitrage  exists.  This  means  that  the  inequality  system 

Vx  y  o,  pT x  <  o 

is  infeasible,  where  Vy  =  v^\ 

Using  Farkas’  lemma,  we  have  no  arbitrage  if  and  only  if  there  exists  y  such  that 

-VTy  +  p  =  0, 


yhO. 


264 


5  Duality 


We  can  use  this  characterization  of  arbitrage-free  prices  and  values  to  solve  several 
interesting  problems. 

Suppose,  for  example,  that  the  values  V  are  known,  and  all  prices  except  the  last 
one,  pn,  are  known.  The  set  of  prices  pn  that  are  consistent  with  the  no-arbitrage 
assumption  is  an  interval,  which  can  be  found  by  solving  a  pair  of  LPs.  The  optimal 
value  of  the  LP 

minimize  pn 

subject  to  VTy  =  p,  y  >  0, 

with  variables  pn  and  y,  gives  the  smallest  possible  arbitrage-free  price  for  asset  n. 
Solving  the  same  LP  with  maximization  instead  of  minimization  yields  the  largest 
possible  price  for  asset  n.  If  the  two  values  are  equal,  i.e.,  the  no-arbitrage  assumption 
leads  us  to  a  unique  price  for  asset  n,  we  say  the  market  is  complete.  For  an  example, 
see  exercise  5.38. 

This  method  can  be  used  to  find  bounds  on  the  price  of  a  derivative  or  option  that 
is  based  on  the  final  value  of  other  underlying  assets,  i.e.,  when  the  value  or  payoff 
of  asset  n  is  a  function  of  the  values  of  the  other  assets. 


5.9  Generalized  inequalities 

In  this  section  we  examine  how  Lagrange  duality  extends  to  a  problem  with  gen¬ 
eralized  inequality  constraints 

minimize  fo{x) 

subject  to  fi(x)  ■<Ki  0,  i  =  1, . . . ,  m  (5.91) 

hi(x)=  0,  i  =  l,...,p, 

where  Ki  C  are  proper  cones.  For  now,  we  do  not  assume  convexity  of  the  prob¬ 
lem  (5.91).  We  assume  the  domain  of  (5.91),  T>  =  flI™=o  dorn  fi  (~l  fjf=i  dom/ij,  is 
nonempty. 


5.9.1  The  Lagrange  dual 

With  each  generalized  inequality  fi(x)  ^<k,  0  in  (5.91)  we  associate  a  Lagrange 
multiplier  vector  Xt  £  Rfei  and  define  the  associated  Lagrangian  as 

L(x,  A,  v)  =  fo(x)  +  Xifi(x)  -\ - b  A mfm(x)  +  v\hi{x)  - b  vphp(: r), 

where  A  =  (Ai, . . . ,  Am)  and  v  =  (i/1; . . . ,  vp).  The  dual  function  is  defined  exactly 
as  in  a  problem  with  scalar  inequalities: 

(m  p 

fo(x)  +  V  Af  fi(x)  +  Viht{x) 

i= 1  i=  1 

Since  the  Lagrangian  is  affine  in  the  dual  variables  (A,^),  and  the  dual  function  is 
a  pointwise  infimum  of  the  Lagrangian,  the  dual  function  is  concave. 
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As  in  a  problem  with  scalar  inequalities,  the  dual  function  gives  lower  bounds 
on  p* ,  the  optimal  value  of  the  primal  problem  (5.91).  For  a  problem  with  scalar 
inequalities,  we  require  A;  >  0.  Here  the  nonnegativity  requirement  on  the  dual 
variables  is  replaced  by  the  condition 

A i  >zk*  0,  i  =  1 - ,ni. 

where  K*  denotes  the  dual  cone  of  A").  In  other  words,  the  Lagrange  multipliers 
associated  with  inequalities  must  be  dual  nonnegative. 

Weak  duality  follows  immediately  from  the  definition  of  dual  cone.  If  A  *  0 

and  fi{x)  “ <Ki  0,  then  A f  fi(x)  <  0.  Therefore  for  any  primal  feasible  point  x  and 
any  A*  cIk*  0,  we  have 

m  p 

fo(x)  +  ^2  A Tfi(x)  +  ^2  <  fo(x)- 

i=  1  j=l 

Taking  the  infhnum  over  x  yields  g( A,  v)  <  p* . 

The  Lagrange  dual  optimization  problem  is 

maximize  g(  X,i/) 

subject  to  A,  0,.  i  =  1, . . . ,  to. 

We  always  have  weak  duality,  i.e.,  d*  <  p* ,  where  d*  denotes  the  optimal  value  of 
the  dual  problem  (5.92),  whether  or  not  the  primal  problem  (5.91)  is  convex. 

Slater’s  condition  and  strong  duality 

As  might  be  expected,  strong  duality  [d*  =  p*)  holds  when  the  primal  problem 
is  convex  and  satisfies  an  appropriate  constraint  qualification.  For  example,  a 
generalized  version  of  Slater’s  condition  for  the  problem 

minimize  fo{x) 

subject  to  fi(x)  -<Ki  0,  i  =  1, . . . ,  m 
Ax  =  b, 

where  fo  is  convex  and  /»  is  Aj-convex,  is  that  there  exists  an  x  €  relint  V  with 
Ax  =  b  and  fi(x)  -<k,  0,  i  =  1, . . .  ,m.  This  condition  implies  strong  duality  (and 
also,  that  the  dual  optimum  is  attained). 


Example  5.11  Lagrange  dual  of  semidefinite  program.  We  consider  a  semidefinite 
program  in  inequality  form, 

minimize  cTx  .  . 

subject  to  xiFi  +  •  •  •  +  xnFn  +  G^0 

where  F\, . . . ,  Fn,  G  £  Sk.  (Here  /i  is  affine,  and  K\  is  S+,  the  positive  semidefinite 
cone.) 

We  associate  with  the  constraint  a  dual  variable  or  multiplier  Z  G  Sfc,  so  the  La- 
grangian  is 

L(x,Z)  =  cT x  +  tr  ((xiFi  H - +  x„Fn  +  G)  Z) 

=  Xi(a  +  tr(Fi^))  H - +  x„(cn  +  tr(FnZ))  +  tr(GZ), 
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which  is  affine  in  x.  The  dual  function  is  given  by 
g(Z)  =  inf  L(x,  Z)  = 


tr  (GZ)  tr(FiZ)  +  a  =  0,  i  =  l,...,n 
— oo  otherwise. 


The  dual  problem  can  therefore  be  expressed  as 


maximize  tr  (GZ) 

subject  to  tv(FiZ)  +  a  =  0,  i  =  1, . . .  ,n 

z  y  o. 


(We  use  the  fact  that  S+  is  self-dual,  i.e.,  (S+)*  =  S+;  see  §2.6.) 

Strong  duality  obtains  if  the  semidefinite  program  (5.93)  is  strictly  feasible,  i.e.,  there 
exists  an  x  with 

X\F\  +  •  •  •  +  i„F„  +  G  -<  0. 


Example  5.12  Lagrange  dual  of  cone  program  in  standard  form.  We  consider  the 
cone  program 

minimize  cTx 
subject  to  Ax  =  b 
x  >K  0, 

where  A  £  Rmx”,  b  £  Rm,  and  K  C  Rn  is  a  proper  cone.  We  associate  with  the 
equality  constraint  a  multiplier  v  £  Rm,  and  with  the  nonnegativity  constraint  a 
multiplier  A  £  Rn.  The  Lagrangian  is 

L(x,  A,  v)  =  cT x  —  \T x  +  oT (Ax  —  b), 


so  the  dual  function  is 


g( A,  u)  =  inf  L(x,  A,  v)  = 


—bTv  AT  v  —  A  +  c  =  0 
— oo  otherwise. 


The  dual  problem  can  be  expressed  as 

maximize  —bTv 
subject  to  ylTi/  +  c  =  A 
A  Fk*  0. 


By  eliminating  A  and  defining  y  =  — v ,  this  problem  can  be  simplified  to 

maximize  bT  y 

subject  to  ATy  <k*  c, 

which  is  a  cone  program  in  inequality  form,  involving  the  dual  generalized  inequality. 

Strong  duality  obtains  if  the  Slater  condition  holds,  i.e.,  there  is  an  x  >~k  0  with 
Ax  =  b. 


5.9.2  Optimality  conditions 

The  optimality  conditions  of  §5.5  are  readily  extended  to  problems  with  generalized 
inequalities.  We  first  derive  the  complementary  slackness  conditions. 
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Complementary  slackness 

Assume  that  the  primal  and  dual  optimal  values  are  equal,  and  attained  at  the 
optimal  points  x*,  A*,  v* .  As  in  §5.5.2,  the  complementary  slackness  conditions 
follow  directly  from  the  equality  fo(x*)  =  g( A*,  i/*),  along  with  the  definition  of  g. 
We  have 


fo(x*)  =  g(  A*,i/*) 

m  p 

<  fo(x*)+^2\ffi(x*)  +  j2^hi(x*) 

i= 1  i= 1 

<  fo{x*), 

and  therefore  we  conclude  that  x*  minimizes  L(x,  A*,iA),  and  also  that  the  two 
sums  in  the  second  line  are  zero.  Since  the  second  sum  is  zero  (since  x *  satisfies 
the  equality  constraints),  we  have  EEi  KT  =  0-  Since  each  term  in  this 

sum  is  nonpositive,  we  conclude  that 

KT  fi{x*)  =0,  i  =  1, . .,  5  m,  (5.94) 

which  generalizes  the  complementary  slackness  condition  (5.48).  From  (5.94)  we 
can  conclude  that 

A*  >-k;  0  =►  fi(x*)  =  0,  /<(»*)  0,  =>  A*  =  0. 

However,  in  contrast  to  problems  with  scalar  inequalities,  it  is  possible  to  sat¬ 
isfy  (5.94)  with  A*  ^  0  and  fi(x*)  ^  0. 


KKT  conditions 

Now  we  add  the  assumption  that  the  functions  fi,  hi  are  differentiable,  and  gener¬ 
alize  the  KKT  conditions  of  §5.5.3  to  problems  with  generalized  inequalities.  Since 
x*  minimizes  L(x ,  A*,  iA),  its  gradient  with  respect  to  x  vanishes  at  x *: 

m  p 

V/o(**)  +  5]  Dfi(x*)T A*  +  vyhi(x*)  =  0, 

i= 1  i= 1 


where  Dfi(x*)  £  is  the  derivative  of  fi  evaluated  at  x*  (see  §A.4.1).  Thus, 

if  strong  duality  holds,  any  primal  optimal  x *  and  any  dual  optimal  (A*,  u*)  must 
satisfy  the  optimality  conditions  (or  KKT  conditions) 


fi(x*)  -<Ki  0, 
hi(x*)  =  0, 

A*  hx;  0, 
A  ffi(x*)  =  0, 

V/o(®*)  +  E™  i  Dfi(x*)T A*  +  ELi  v*Vhi{x*)  =  0. 


i  =  1 , . . . ,  m 
*  =  !,■••,  P 
i  =  1 , . . . ,  to 
i  =  1 , . . . ,  to 


(5.95) 

If  the  primal  problem  is  convex,  the  converse  also  holds,  i.e.,  the  conditions  (5.95) 
are  sufficient  conditions  for  optimality  of  x*,  (A*,zA). 
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5.9.3  Perturbation  and  sensitivity  analysis 

The  results  of  §5.6  can  be  extended  to  problems  involving  generalized  inequalities. 
We  consider  the  associated  perturbed  version  of  the  problem, 

minimize  fo(x) 

subject  to  fi(x)  -<Ki  Ui,  i  =  1, . . . ,  m 
hi(x)  =  Vi, 

where  Ui  £  Rfci,  and  v  £  Rp.  We  define  p*(u,v)  as  the  optimal  value  of  the 
perturbed  problem.  As  in  the  case  with  scalar  inequalities,  p *  is  a  convex  function 
when  the  original  problem  is  convex. 

Now  let  (A*,  v*)  be  optimal  for  the  dual  of  the  original  (unperturbed)  problem, 
which  we  assume  has  zero  duality  gap.  Then  for  all  u  and  v  we  have 

m 

p*(u,  v)>p*~Y^  KT ui  -  v*Tv, 

i= 1 

the  analog  of  the  global  sensitivity  inequality  (5.57).  The  local  sensitivity  result 
holds  as  well:  If  p*(u,v)  is  differentiable  at  u  =  0,  v  =  0,  then  the  optimal  dual 
variables  A*  satisfies 

A*  =  -VUip*(0,0), 

the  analog  of  (5.58). 


Example  5.13  Semidefinite  program  in  inequality  form.  We  consider  a  semidefinite 
program  in  inequality  form,  as  in  example  5.11.  The  primal  problem  is 

minimize  cTx 

subject  to  F(x)  ~  xiFi  +  •  •  •  +  xnFn  +  G  X  0, 
with  variable  x  £  R’1  (and  Fi, . . . ,  Fn,  G  £  Sfc),  and  the  dual  problem  is 
maximize  tr  (GZ) 

subject  to  ti^Ti^)  +  a  —  0,  i  =  1, . . . ,  n 

zy  0, 


with  variable  Z  £  Sfc. 

Suppose  that  x*  and  Z*  are  primal  and  dual  optimal,  respectively,  with  zero  duality 
gap.  The  complementary  slackness  condition  is  tr (F(x*)Z*)  =  0.  Since  F( x*)  -<  0 
and  Z*  y  0,  we  can  conclude  that  F(x*)Z*  =  0.  Thus,  the  complementary  slackness 
condition  can  be  expressed  as 


77 (F{x*))  T  1Z(Z*), 

i.e.,  the  ranges  of  the  primal  and  dual  matrices  are  orthogonal. 
Let  p*(U)  denote  the  optimal  value  of  the  perturbed  SDP 

minimize  cTx 

subject  to  F(x)  —  x\F\  +  •  •  •  +  xnFn  +  G  <  U. 
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Then  we  have,  for  all  U,  p*(U)  >  p*  —  tr (Z*U).  If  p*(U)  is  differentiable  at  U  =  0, 
then  we  have 

Vp*(0)  =  -Z\ 

This  means  that  for  U  small,  the  optimal  value  of  the  perturbed  SDP  is  very  close 
to  (the  lower  bound)  p*  —  tr(Z*U). 


5.9.4  Theorems  of  alternatives 

We  can  derive  theorems  of  alternatives  for  systems  of  generalized  inequalities  and 
equalities 


fi(x)  d:Ki  0,  i  =  1, . . .  ,m,  hi(x)  =  0,  i  =  (5.96) 

where  Ki  C  Rfc’  are  proper  cones.  We  will  also  consider  systems  with  strict  in¬ 
equalities, 


fi{x)  ^Ki  0,  i  =  1, . . . ,  to,  hi(x)  =  0,  i  =  (5.97) 

We  assume  that  V  —  p|”L0  d om/,  fl  Df=i  dom/q  is  nonempty. 

Weak  alternatives 

We  associate  with  the  systems  (5.96)  and  (5.97)  the  dual  function 

(m  p 

Y  +  J2Ulhi^x) 

i=  1  i= 1 

where  A  =  (Ai,...,Am)  with  A*  £  Rfci  and  v  £  Rp.  In  analogy  with  (5.76),  we 
claim  that 

A i  >z k *  0,  i  =  1, . . . ,  to,  g( A,  v)  >  0  (5.98) 

is  a  weak  alternative  to  the  system  (5.96).  To  verify  this,  suppose  there  exists  an 
x  satisfying  (5.96)  and  (A,  is)  satisfying  (5.98).  Then  we  have  a  contradiction: 

0  <  g(\,  v)  <  \{fi(x)  H - b  +  v\hi{x)  -\ - b  vphp{x)  <  0. 

Therefore  at  least  one  of  the  two  systems  (5.96)  and  (5.98)  must  be  infeasible,  i.e., 
the  two  systems  are  weak  alternatives. 

In  a  similar  way,  we  can  prove  that  (5.97)  and  the  system 

Ai  'tiKr  0,  i  =  1, . . . ,  to,  A  ^  0,  g( A,  v)  >  0. 

form  a  pair  of  weak  alternatives. 

Strong  alternatives 

We  now  assume  that  the  functions  fi  are  AVconvex,  and  the  functions  hi  are  affine. 
We  first  consider  a  system  with  strict  inequalities 


fi(x)  -<Ki  0,  i  =  1, . .  -  ,m, 


Ax  =  5, 


(5.99) 
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and  its  alternative 

Aj  tiK*  0,  i  =  l,...,m,  A  ^  0,  g(X,  v)  >  0.  (5.100) 

We  have  already  seen  that  (5.99)  and  (5.100)  are  weak  alternatives.  They  are  also 
strong  alternatives  provided  the  following  constraint  qualification  holds:  There 
exists  an  x  £  relint  T>  with  Ax  =  b.  To  prove  this,  we  select  a  set  of  vectors 
et  >~Ki  0,  and  consider  the  problem 

minimize  s 

subject  to  fi(x)  -<Ki  sei,  i  =  1, . . .  ,  to  (5.101) 

Ax  =  b 

with  variables  x  and  s  £  R.  Slater’s  condition  holds  since  (x,  s)  satisfies  the  strict 
inequalities  fi(x)  -<Ki  set  provided  s  is  large  enough. 

The  dual  of  (5.101)  is 

maximize  g{ A,  v) 

subject  to  A i  0,  i  =  1, . . . ,  m  (5.102) 

££1^  =  1 

with  variables  A  =  (Ai, . . . ,  Am)  and  v. 

Now  suppose  the  system  (5.99)  is  infeasible.  Then  the  optimal  value  of  (5.101) 
is  nonnegative.  Since  Slater’s  condition  is  satisfied,  we  have  strong  duality  and  the 
dual  optimum  is  attained.  Therefore  there  exist  (A,  v)  that  satisfy  the  constraints 
of  (5.102)  and  g(X,i>)  >  0,  i.e.,  the  system  (5.100)  has  a  solution. 

As  we  noted  in  the  case  of  scalar  inequalities,  existence  of  an  x  €  relint  T>  with 
Ax  =  b  is  not  sufficient  for  the  system  of  nonstrict  inequalities 

fi(x)  <Ki  0,  i  =  1, . . . ,  m,  Ax  =  b 

and  its  alternative 

Ai  hx*  0,  i  =  1, . . . ,  to,  g(X,  is)  >  0 

to  be  strong  alternatives.  An  additional  condition  is  required,  e.g.,  that  the  optimal 
value  of  (5.101)  is  attained. 


Example  5.14  Feasibility  of  a  linear  matrix  inequality.  The  following  systems  are 
strong  alternatives: 

F{x)  —  x  1  F\  +  •  •  ■  +  XnFn  +  G  ^  0, 

where  Fi,G  £  Sk ,  and 

Z  y  0,  Z  A  0,  tr(GZ)  >  0,  tr (FiZ)  =  0,  i  =  1, . . . ,  n, 

where  Z  £  Sfe.  This  follows  from  the  general  result,  if  we  take  for  K  the  positive 
semidefinite  cone  S+,  and 


9 iz)  =  inf  (tr (F(x)Z))  =  j 


tr(GZ)  tr(FiZ)  =  0,  i=l,...,n 

—00  otherwise. 
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The  nonstrict  inequality  case  is  slightly  more  involved,  and  we  need  an  extra  assump¬ 
tion  on  the  matrices  F%  to  have  strong  alternatives.  One  such  condition  is 

n 

ViFi  y  0  =>  ViFi  =  0. 

<= 1  i=  1 

If  this  condition  holds,  the  following  systems  are  strong  alternatives: 

F(x)  —  x  i  V i  +  •  •  •  -(-  xnFn  +  G  ^  0 

and 

Z  y  0,  tr (GZ)  >  0,  tr(FiZ)  =  0,  i  =  1, . . . ,  n 
(see  exercise  5.44). 
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47].  It  is  discussed  in  the  context  of  cone  programming  in  Nesterov  and  Nemirovski 
[NN94,  §4.2]  and  Ben-Tal  and  Nemirovski  [BTN01,  lecture  2].  Theorems  of  alternatives 
for  generalized  inequalities  were  studied  by  Ben-Israel  [BI69],  Berman  and  Ben-Israel 
[BBI71],  and  Craven  and  Kohila  [CK77].  Bellman  and  Fan  [BF63],  Wolkowicz  [Wol81], 
and  Lasserre  [Las95]  give  extensions  of  Farkas’  lemma  to  linear  matrix  inequalities. 
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Exercises 

Basic  definitions 

5.1  A  simple  example.  Consider  the  optimization  problem 

minimize  x2  +  1 

subject  to  ( x  —  2)(x  —  4)  <  0, 


with  variable  x  £  R. 

(a)  Analysis  of  primal  problem.  Give  the  feasible  set,  the  optimal  value,  and  the  optimal 
solution. 

(b)  Lagrangian  and  dual  function.  Plot  the  objective  x2  +  1  versus  x.  On  the  same  plot, 
show  the  feasible  set,  optimal  point  and  value,  and  plot  the  Lagrangian  L(x,  A)  versus 
x  for  a  few  positive  values  of  A.  Verify  the  lower  bound  property  (p*  >  infa,  L(x ,  A) 
for  A  >  0).  Derive  and  sketch  the  Lagrange  dual  function  g. 

(c)  Lagrange  dual  problem.  State  the  dual  problem,  and  verify  that  it  is  a  concave 
maximization  problem.  Find  the  dual  optimal  value  and  dual  optimal  solution  A*. 
Does  strong  duality  hold? 

(d)  Sensitivity  analysis.  Let  p*[u)  denote  the  optimal  value  of  the  problem 

minimize  x2  +  1 

subject  to  (x  —  2)(x  —  4)  <  u, 

as  a  function  of  the  parameter  u.  Plot  p*(u).  Verify  that  dp*(0)/du  =  —A*. 

5.2  Weak  duality  for  unbounded  and  infeasible  problems.  The  weak  duality  inequality,  d*  <  p* , 
clearly  holds  when  d *  =  — oo  or  p*  =  oo.  Show  that  it  holds  in  the  other  two  cases  as 
well:  If  p*  =  — oo,  then  we  must  have  d*  =  — oo,  and  also,  if  d*  =  oo,  then  we  must  have 
p*  =  oo. 

5.3  Problems  with  one  inequality  constraint.  Express  the  dual  problem  of 

minimize  cTx 
subject  to  f{x)  <  0, 

with  c  ^  0,  in  terms  of  the  conjugate  /*.  Explain  why  the  problem  you  give  is  convex. 
We  do  not  assume  /  is  convex. 

Examples  and  applications 

5.4  Interpretation  of  LP  dual  via  relaxed  problems.  Consider  the  inequality  form  LP 

minimize  cTx 
subject  to  Ax  A  b, 

with  A  £  Rmx",  b  £  Rm.  In  this  exercise  we  develop  a  simple  geometric  interpretation 
of  the  dual  LP  (5.22). 

Let  w  £  R+.  If  x  is  feasible  for  the  LP,  i.e.,  satisfies  Ax  P  6,  then  it  also  satisfies  the 
inequality 

w1  Ax  <  w 1  b. 

Geometrically,  for  any  w  P  0,  the  halfspace  Hw  =  {x  \  wT  Ax  <  wTb}  contains  the  feasible 
set  for  the  LP.  Therefore  if  we  minimize  the  objective  cTx  over  the  halfspace  Hw  we  get 
a  lower  bound  on  p * . 
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(a)  Derive  an  expression  for  the  minimum  value  of  cTx  over  the  halfspace  Hw  (which 
will  depend  on  the  choice  of  w  P  0). 

(b)  Formulate  the  problem  of  finding  the  best  such  bound,  by  maximizing  the  lower 
bound  over  w  >z  0. 

(c)  Relate  the  results  of  (a)  and  (b)  to  the  Lagrange  dual  of  the  LP,  given  by  (5.22). 

5.5  Dual  of  general  LP.  Find  the  dual  function  of  the  LP 

minimize  cTx 
subject  to  Gx  <  h 
Ax  =  b. 

Give  the  dual  problem,  and  make  the  implicit  equality  constraints  explicit. 

5.6  Lower  bounds  in  Chebyshev  approximation  from  least-squares.  Consider  the  Chebyshev 
or  foo-norm  approximation  problem 

minimize  \\Ax  —  5||oo,  (5.103) 

where  A  £  Rrax"  and  rank  A  =  n.  Let  xch  denote  an  optimal  solution  (there  may  be 
multiple  optimal  solutions;  *ch  denotes  one  of  them). 

The  Chebyshev  problem  has  no  closed-form  solution,  but  the  corresponding  least-squares 
problem  does.  Define 


xis  =  argmin  ||  Ax  —  5|| 2  =  (AT A)  1ATb. 

We  address  the  following  question.  Suppose  that  for  a  particular  A  and  b  we  have  com¬ 
puted  the  least-squares  solution  xis  (but  not  a:ch).  How  suboptimal  is  *is  for  the  Chebyshev 
problem?  In  other  words,  how  much  larger  is  Harris  —  5||oo  than  ||A:rch  —  5||oo? 

(a)  Prove  the  lower  bound 

||j4xis  5 1| 00  A  y/rn  ||^4xch  5||oo, 

using  the  fact  that  for  all  2  £  Rm, 

-^IN|2<|N|oo<IN|2. 

(b)  In  example  5.6  (page  254)  we  derived  a  dual  for  the  general  norm  approximation 
problem.  Applying  the  results  to  the  Lx>-norm  (and  its  dual  norm,  the  Id-norm),  we 
can  state  the  following  dual  for  the  Chebyshev  approximation  problem: 

maximize  bT  v 

subject  to  |M|i  <  1  (5.104) 

Atv  =  0. 


Any  feasible  v  corresponds  to  a  lower  bound  bT v  on  ||Axch  —  6||oo- 

Denote  the  least-squares  residual  as  r\s  =  b  —  Ax is.  Assuming  ns  ^  0,  show  that 


£=  -ns/||ns||i,  v  =  ns/IMli, 


are  both  feasible  in  (5.104).  By  duality  bT u  and  bT v  are  lower  bounds  on  ||Axch  — 
b 1 1 00  ■  Which  is  the  better  bound?  How  do  these  bounds  compare  with  the  bound 
derived  in  part  (a)? 
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5.7  Piecewise-linear  minimization.  We  consider  the  convex  piecewise-linear  minimization 
problem 

minimize  maxi=ij...!m  (afx  +  bi)  (5.105) 

with  variable  x  £  Rn. 

(a)  Derive  a  dual  problem,  based  on  the  Lagrange  dual  of  the  equivalent  problem 

minimize  maxj=i 

subject  to  aj x  +  bi  =  yi,  i  =  1, . . . ,  m, 
with  variables  x  £  R™,  y  £  Rm. 

(b)  Formulate  the  piecewise-linear  minimization  problem  (5.105)  as  an  LP,  and  form  the 
dual  of  the  LP.  Relate  the  LP  dual  to  the  dual  obtained  in  part  (a). 

(c)  Suppose  we  approximate  the  objective  function  in  (5.105)  by  the  smooth  function 


x  +  bi) J  , 

and  solve  the  unconstrained  geometric  program 

minimize  log  (Yl’iL i  exP(aT x  +  &«))  •  (5.106) 

A  dual  of  this  problem  is  given  by  (5.62).  Let  p*wl  and  p|p  be  the  optimal  values 
of  (5.105)  and  (5.106),  respectively.  Show  that 

0  <  PgP  -  Ppwi  <  log  m. 


fo(x)  =  log  (  ^  exp(c 


(d)  Derive  similar  bounds  for  the  difference  between  p*wl  and  the  optimal  value  of 
minimize  (l/q)  log  (£)™i  exp(7(af  a:  +  bi)))  , 

where  7  >  0  is  a  parameter.  What  happens  as  we  increase  7? 

5.8  Relate  the  two  dual  problems  derived  in  example  5.9  on  page  257. 

5.9  Suboptimality  of  a  simple  covering  ellipsoid.  Recall  the  problem  of  determining  the  min¬ 
imum  volume  ellipsoid,  centered  at  the  origin,  that  contains  the  points  ai , . . . ,  am  £  R" 
(problem  (5.14),  page  222): 

minimize  fo(X)  =  log  det  ( X  ~ 1 ) 
subject  to  afXai  <1,  i  =  1, . . .  ,m, 


with  dom  fo  =  S"  +  .  We  assume  that  the  vectors  ai, ,  am  span  R’1  (which  implies  that 
the  problem  is  bounded  below). 


(a)  Show  that  the  matrix 


A'Smi  =  J2 


T 

akak 


is  feasible.  Hint.  Show  that 


Era  T 

k=l  ak&k  ai 

af  1 


to, 


and  use  Schur  complements  (§A.5.5)  to  prove  that  afXai  <  1  for  i  =  1, . . . ,  m. 
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(b)  Now  we  establish  a  bound  on  how  suboptimal  the  feasible  point  Xsim  is,  via  the  dual 
problem, 

maximize  logdet  (E™  r  A iOjaf  )  —  1TA  +  n 
subject  to  A  y  0, 

with  the  implicit  constraint  E;=i  ^ iaiaI  P  0.  (This  dual  is  derived  on  page  222.) 
To  derive  a  bound,  we  restrict  our  attention  to  dual  variables  of  the  form  A  =  tl, 
where  t  >  0.  Find  (analytically)  the  optimal  value  of  t,  and  evaluate  the  dual 
objective  at  this  A.  Use  this  to  prove  that  the  volume  of  the  ellipsoid  {u  \  uTXBimu  < 
1}  is  no  more  than  a  factor  (m/n)n^2  more  than  the  volume  of  the  minimum  volume 
ellipsoid. 

5.10  Optimal  experiment  design.  The  following  problems  arise  in  experiment  design  (see  §7.5). 

(a)  D-optimal  design. 

minimize  log  det  (Ef-i  XiVjvf) 
subject  to  x  >z  0,  1T x  =  1. 

(b)  A-optimal  design. 

minimize  me  ?=1wr 

subject  to  *  y  0,  1T x  =  1. 

The  domain  of  both  problems  is  {x  \  Ef=i  XiVivf  y  0}.  The  variable  is  x  £  Rp;  the 
vectors  Vi, . . .  ,vp  £  Rn  are  given. 

Derive  dual  problems  by  first  introducing  a  new  variable  X  £  Sn  and  an  equality  con¬ 
straint  X  —  E^-i  xivivT i  all(i  then  applying  Lagrange  duality.  Simplify  the  dual  prob¬ 
lems  as  much  as  you  can. 

5.11  Derive  a  dual  problem  for 

minimize  E^i  II"4*®  +  M 2  +  (1/2) ||m  -  x0||l- 

The  problem  data  are  Ai  £  Rm*Xn,  ^  R,m«;  and  xo  £  Rn.  First  introduce  new  variables 
Hi  £  R'"*  and  equality  constraints  %n  =  AiX  +  bi. 

5.12  Analytic  centering.  Derive  a  dual  problem  for 

minimize  —  E!=i  l°g(&*  —  aI x) 

with  domain  {x  \  aj x  <  bi,  i  =  1, . . .  ,m}.  First  introduce  new  variables  j/i  and  equality 
constraints  xji  =  bi  —  aj  x. 

(The  solution  of  this  problem  is  called  the  analytic  center  of  the  linear  inequalities  aj x  < 
bi,  i  =  1, . . . ,  m.  Analytic  centers  have  geometric  applications  (see  §8.5.3),  and  play  an 
important  role  in  barrier  methods  (see  chapter  11).) 

5.13  Lagrangian  relaxation  of  Boolean  LP.  A  Boolean  linear  program  is  an  optimization  prob¬ 
lem  of  the  form 

minimize  cTx 
subject  to  Ax  A  b 

Xi  £  (0, 1},  i  =  1, . . .  ,n, 

and  is,  in  general,  very  difficult  to  solve.  In  exercise  4.15  we  studied  the  LP  relaxation  of 
this  problem, 

minimize  cTx 

subject  to  Ax  <  b  (5.107) 

0  <  Xi  <  1,  i  =  1, . . . ,  n, 

which  is  far  easier  to  solve,  and  gives  a  lower  bound  on  the  optimal  value  of  the  Boolean 
LP.  In  this  problem  we  derive  another  lower  bound  for  the  Boolean  LP,  and  work  out  the 
relation  between  the  two  lower  bounds. 
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(a)  Lagrangian  relaxation.  The  Boolean  LP  can  be  reformulated  as  the  problem 

minimize  cTx 
subject  to  Ax  X  b 

Xi(  1  —  Xi)  =  0,  i  =  1, . . .  ,n, 


which  has  quadratic  equality  constraints.  Find  the  Lagrange  dual  of  this  problem. 
The  optimal  value  of  the  dual  problem  (which  is  convex)  gives  a  lower  bound  on 
the  optimal  value  of  the  Boolean  LP.  This  method  of  finding  a  lower  bound  on  the 
optimal  value  is  called  Lagrangian  relaxation. 

(b)  Show  that  the  lower  bound  obtained  via  Lagrangian  relaxation,  and  via  the  LP 
relaxation  (5.107),  are  the  same.  Hint.  Derive  the  dual  of  the  LP  relaxation  (5.107). 

5.14  A  penalty  method  for  equality  constraints.  We  consider  the  problem 


minimize  fo(x) 
subject  to  Ax  =  b, 


(5.108) 


where  fo  :  Rn  — >  R  is  convex  and  differentiable,  and  A  £  RmXn  with  rank  A  =  m. 

In  a  quadratic  penalty  method ,  we  form  an  auxiliary  function 

<f>{x)  =  fo(x)  +  «|| Ax  -  b|||, 

where  a  >  0  is  a  parameter.  This  auxiliary  function  consists  of  the  objective  plus  the 
penalty  term  a||Ar  —  b|||.  The  idea  is  that  a  minimizer  of  the  auxiliary  function,  x,  should 
be  an  approximate  solution  of  the  original  problem.  Intuition  suggests  that  the  larger  the 
penalty  weight  a,  the  better  the  approximation  x  to  a  solution  of  the  original  problem. 
Suppose  x  is  a  minimizer  of  <j>.  Show  how  to  find,  from  x,  a  dual  feasible  point  for  (5.108). 
Find  the  corresponding  lower  bound  on  the  optimal  value  of  (5.108). 

5.15  Consider  the  problem 


minimize  fo  ( x ) 

subject  to  fi(x)  <0,  i  =  1, . . . ,  m, 


(5.109) 


where  the  functions  fi  :  R™  — >  R  are  differentiable  and  convex.  Let  hi, ... ,  hm  :  R  — ¥  R 
be  increasing  differentiable  convex  functions.  Show  that 


<t>(x)  =  fo(x)  +  ^2  hi{fi(x)) 

i=  1 

is  convex.  Suppose  x  minimizes  (f>.  Show  how  to  find  from  x  a  feasible  point  for  the  dual 
of  (5.109).  Find  the  corresponding  lower  bound  on  the  optimal  value  of  (5.109). 

5.16  An  exact  penalty  method  for  inequality  constraints.  Consider  the  problem 

minimize  f0(x)  nQ, 

subject  to  /;(*)<  0,  i  =  1, . . . ,  m,  v 

where  the  functions  fi  :  R11  — »  R  are  differentiable  and  convex.  In  an  exact  penalty 
method,  we  solve  the  auxiliary  problem 

minimize  <j>(x)  =  fo{x)  +  a  maxi=i>...jm  max{0,  fi(x)},  (5.111) 

where  a  >  0  is  a  parameter.  The  second  term  in  <f>  penalizes  deviations  of  x  from  feasibility. 
The  method  is  called  an  exact  penalty  method  if  for  sufficiently  large  a,  solutions  of  the 
auxiliary  problem  (5.111)  also  solve  the  original  problem  (5.110). 

(a)  Show  that  <j>  is  convex. 
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(b)  The  auxiliary  problem  can  be  expressed  as 

minimize  fo(x)  +  ay 
subject  to  fi{x)  <  y,  i  =  1, . . . ,  m 
0  <y 

where  the  variables  are  x  and  y  £  R.  Find  the  Lagrange  dual  of  this  problem,  and 
express  it  in  terms  of  the  Lagrange  dual  function  g  of  (5.110). 

(c)  Use  the  result  in  (b)  to  prove  the  following  property.  Suppose  A*  is  an  optimal 
solution  of  the  Lagrange  dual  of  (5.110),  and  that  strong  duality  holds.  If  a  > 
1tA*,  then  any  solution  of  the  auxiliary  problem  (5.111)  is  also  an  optimal  solution 
of  (5.110). 

5.17  Robust  linear  programming  with  polyhedral  uncertainty.  Consider  the  robust  LP 

minimize  cTx 

subject  to  supo6P.  aT x  <  bt,  i  =  1, . . . ,  m, 

with  variable  x  £  R",  where  Vi  =  {a  \  Cta  7)  di}.  The  problem  data  are  c  £  Rn, 
Ci  £  R_m>Xn;  dj  £  Rmi,  and  b  £  Rm.  We  assume  the  polyhedra  Vi  are  nonempty. 

Show  that  this  problem  is  equivalent  to  the  LP 

minimize  cTx 

subject  to  dfzi  <  bi,  i  =  1, . . . ,  m 
Cj  Zi  =  x,  i  =  1, . . .  ,m 
Zi  y  0,  i  =  1, . . .  ,m 

with  variables  x  £  Rn  and  Zi  £  Rmi,  i  =  1, . . .  ,m.  Hint.  Find  the  dual  of  the  problem 
of  maximizing  aj x  over  at  £  Vi  (with  variable  at). 

5.18  Separating  hyperplane  between  two  polyhedra.  Formulate  the  following  problem  as  an  LP 
or  an  LP  feasibility  problem.  Find  a  separating  hyperplane  that  strictly  separates  two 
polyhedra 

Vi  =  {x  |  Ax  ^  b},  V2  ~  {x\  Cx  -<  d}, 
i.e.,  find  a  vector  a  £  Rn  and  a  scalar  7  such  that 

aT x  >  7  for  x  £  Vi ,  aT x  <  7  for  x  £  Vi- 

You  can  assume  that  V\  and  V2  do  not  intersect. 

Hint.  The  vector  a  and  scalar  7  must  satisfy 

inf  aT x  >  7  >  sup  aTx. 
xG'Pi  x€V2 

Use  LP  duality  to  simplify  the  infimum  and  supremum  in  these  conditions. 

5.19  The  sum  of  the  largest  elements  of  a  vector.  Define  /  :  Rn  — v  R  as 

r 

f(x)  = 

i=  1 

where  r  is  an  integer  between  1  and  n,  and  xen  >  xm  >  •  •  •  >  *[r]  are  the  components  of 
x  sorted  in  decreasing  order.  In  other  words,  f(x)  is  the  sum  of  the  r  largest  elements  of 
x.  In  this  problem  we  study  the  constraint 

f(x)  <  a. 

As  we  have  seen  in  chapter  3,  page  80,  this  is  a  convex  constraint,  and  equivalent  to  a  set 
of  n!/(r!(n  —  r)!)  linear  inequalities 

xn  4 - +  Xir  <  a,  1  <  h  <  12  <  ■  ■  ■  <  ir  <  n. 

The  purpose  of  this  problem  is  to  derive  a  more  compact  representation. 
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(a)  Given  a  vector  x  £  Rn,  show  that  f(x)  is  equal  to  the  optimal  value  of  the  LP 

maximize  xTy 
subject  to 

1  Ty  =  r 

with  y  £  Rn  as  variable. 

(b)  Derive  the  dual  of  the  LP  in  part  (a).  Show  that  it  can  be  written  as 

minimize  rt  +  1  Tu 
subject  to  tl  +  u  y  x 
u  y  0, 

where  the  variables  are  t  £  R,  it  £  R".  By  duality  this  LP  has  the  same  optimal 
value  as  the  LP  in  (a),  i.e.,  f(x).  We  therefore  have  the  following  result:  x  satisfies 
f(x)  <  a  if  and  only  if  there  exist  t  £  R,  u  £  R"  such  that 

rt  +  1  Tu  <  a,  tl  +  it  >z  x,  u  >r  0. 

These  conditions  form  a  set  of  2n+ 1  linear  inequalities  in  the  2n+ 1  variables  x,  it,  t. 

(c)  As  an  application,  we  consider  an  extension  of  the  classical  Markowitz  portfolio 
optimization  problem 


minimize  xTT,x 
subject  to  pT x  >  rmin 

1T  x  =  1,  x  y  0 

discussed  in  chapter  4,  page  155.  The  variable  is  the  portfolio  x  £  R”;  p  and  S  are 
the  mean  and  covariance  matrix  of  the  price  change  vector  p. 

Suppose  we  add  a  diversification  constraint ,  requiring  that  no  more  than  80%  of 
the  total  budget  can  be  invested  in  any  10%  of  the  assets.  This  constraint  can  be 
expressed  as 

LO.lnJ 

y.  £[«]  <  o.8. 

i=  1 

Formulate  the  portfolio  optimization  problem  with  diversification  constraint  as  a 

QP. 

5.20  Dual  of  channel  capacity  problem.  Derive  a  dual  for  the  problem 

minimize  -cTx  +  V'  loS  Vi 
subject  to  Px  =  y 

x  y  0,  1T x  =  1, 

where  P  £  Rmxn  has  nonnegative  elements,  and  its  columns  add  up  to  one  (i.e.,  PT 1  = 
1).  The  variables  are  x  £  R" ,  y  £  R™ .  (For  Cj  =  pg  log pij,  the  optimal  value  is, 
up  to  a  factor  log  2,  the  negative  of  the  capacity  of  a  discrete  memoryless  channel  with 
channel  transition  probability  matrix  P\  see  exercise  4.57.) 

Simplify  the  dual  problem  as  much  as  possible. 
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Strong  duality  and  Slater’s  condition 

5.21  A  convex  problem  in  which  strong  duality  fails.  Consider  the  optimization  problem 

minimize  e~x 
subject  to  x2  /y  <  0 

with  variables  x  and  y,  and  domain  T>  =  {(*,  y)  \  y  >  0}. 

(a)  Verify  that  this  is  a  convex  optimization  problem.  Find  the  optimal  value. 

(b)  Give  the  Lagrange  dual  problem,  and  find  the  optimal  solution  A*  and  optimal  value 
d*  of  the  dual  problem.  What  is  the  optimal  duality  gap? 

(c)  Does  Slater’s  condition  hold  for  this  problem? 

(d)  What  is  the  optimal  value  p*(u)  of  the  perturbed  problem 

minimize  e~x 
subject  to  x2 /y  <  u 

as  a  function  of  it?  Verify  that  the  global  sensitivity  inequality 

p*(u)  >  p*(0)  —  A *u 

does  not  hold. 

5.22  Geometric  interpretation  of  duality.  For  each  of  the  following  optimization  problems, 
draw  a  sketch  of  the  sets 

G  =  {{u,t)  \3x  eT>,  f0(x)  =  t,  fi(x)  =  u}, 

A  =  {(u,t)  |  3 x  eV,  f0(x)  <  t,  fi(x)  <  u}, 

give  the  dual  problem,  and  solve  the  primal  and  dual  problems.  Is  the  problem  convex? 
Is  Slater’s  condition  satisfied?  Does  strong  duality  hold? 


The  domain  of  the  problem  is  R  unless  otherwise  stated. 

(a) 

Minimize  x  subject  to  x2  <  1. 

(b) 

Minimize  x  subject  to  x2  <  0. 

(c) 

Minimize  x  subject  to  \x\  <  0. 

(d) 

Minimize  x  subject  to  fi(x)  <  0  where 

f  -x  +  2 

x  >  1 

fi(x)  =  <  X 

-1  <  X  <  1 

{  -x-2 

X  <  —1. 

(e)  Minimize  x3  subject  to  —  x  +  1  <  0. 

(f)  Minimize  x3  subject  to  —x  +  1  <  0  with  domain  D  =  R+. 

5.23  Strong  duality  in  linear  programming.  We  prove  that  strong  duality  holds  for  the  LP 

minimize  cTx 
subject  to  Ax  X  b 

and  its  dual 

maximize  —bTz 

subject  to  AT z  +  c  =  0,  z  >;  0, 

provided  at  least  one  of  the  problems  is  feasible.  In  other  words,  the  only  possible  excep¬ 
tion  to  strong  duality  occurs  when  p*  —  oo  and  d*  =  —oo. 
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(a)  Suppose  p*  is  finite  and  x*  is  an  optimal  solution.  (If  finite,  the  optimal  value  of  an 
LP  is  attained.)  Let  I  C  {1,2,...,  m}  be  the  set  of  active  constraints  at  x*: 

af x*  =  bi,  i  €  I,  ajx*  <  bi,  i  0  I. 

Show  that  there  exists  a  z  £  Rm  that  satisfies 

Zi  >  0,  iel,  Zi  =  0,  i  £  I,  Ziai  +  c  =  0. 

iei 

Show  that  z  is  dual  optimal  with  objective  value  cTx* . 

Hint.  Assume  there  exists  no  such  2,  i.e.,  — c  0  {y~\ cr  Zjaj  \  Zi  >  0}.  Reduce 
this  to  a  contradiction  by  applying  the  strict  separating  hyperplane  theorem  of 
example  2.20,  page  49.  Alternatively,  you  can  use  Farkas’  lemma  (see  §5.8.3). 

(b)  Suppose  p*  =  oo  and  the  dual  problem  is  feasible.  Show  that  d*  =  oo.  Hint.  Show 
that  there  exists  a  nonzero  v  £  Rm  such  that  ATv  =  0,  v  >:  0,  bT v  <  0.  If  the  dual 
is  feasible,  it  is  unbounded  in  the  direction  v. 

(c)  Consider  the  example 


minimize 
subject  to 


x 


'  0  " 

X  ^ 

’  -l 

1 

l 

Formulate  the  dual  LP,  and  solve  the  primal  and  dual  problems.  Show  that  p*  =  oo 
and  d*  =  — oo. 


5.24  Weak  max-min  inequality.  Show  that  the  weak  max- min  inequality 

sup  inf  f(u>,  z)  <  inf  sup  f(w,  z) 
zez  wew  we w  zez 

always  holds,  with  no  assumptions  on  /  :  R"  x  Rm  — >■  R,  W  C  R11,  or  Z  C  R™. 

5.25  [BL00,  page  95]  Convex-concave  functions  and  the  saddle-point  property.  We  derive  con¬ 
ditions  under  which  the  saddle-point  property 

sup  inf  f(ut,z)=  inf  sup/(u>,z)  (5.112) 

zez  we w  we w  zez 


holds,  where  /  :  Rn  x  Rm  ->  R,  W  x  Z  C  dom  /,  and  W  and  Z  are  nonempty.  We  will 
assume  that  the  function 

,  ,  f  f(w,  z)  w  £  W 
9 ^  =  {oo  otherwise 

is  closed  and  convex  for  all  z  £  Z,  and  the  function 

h  (A  =  I  zGZ 

w''  {  oo  otherwise 

is  closed  and  convex  for  all  w  £  W. 

(a)  The  righthand  side  of  (5.112)  can  be  expressed  as  p(0),  where 

p(u)  =  inf  sup  {f(w,  z)  +  uTz). 
wew  zez 


Show  that  p  is  a  convex  function. 
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(b)  Show  that  the  conjugate  of  p  is  given  by 

*,  f  -  mi-wew  f(w,  v)  v&Z 

^  1  oo  otherwise. 

(c)  Show  that  the  conjugate  of  p*  is  given  by 

p**(u)=  sup  inf  (f(w,z)  +  uTz). 

zez  wew 

Combining  this  with  (a),  we  can  express  the  max-min  equality  (5.112)  as  p**( 0)  = 

p(0). 

(d)  From  exercises  3.28  and  3.39  (d),  we  know  that  p**(0)  =  p(0)  if  0  £  intdomp. 
Conclude  that  this  is  the  case  if  W  and  Z  are  bounded. 

(e)  As  another  consequence  of  exercises  3.28  and  3.39,  we  have  p**(0)  =  p( 0)  if  0  £ 
domp  and  p  is  closed.  Show  that  p  is  closed  if  the  sublevel  sets  of  gz  are  bounded. 


Optimality  conditions 

5.26  Consider  the  QCQP 

minimize  *2  +  x\ 
subject  to  (xi  —  l)2  +  (x2  —  l)2  <  1 
( Xl  —  1)“  +  (x2  +  l)2  ^  1 


with  variable  x  £  R2. 

(a)  Sketch  the  feasible  set  and  level  sets  of  the  objective.  Find  the  optimal  point  x*  and 
optimal  value  p* . 

(b)  Give  the  KKT  conditions.  Do  there  exist  Lagrange  multipliers  A*  and  A£  that  prove 
that  x*  is  optimal? 

(c)  Derive  and  solve  the  Lagrange  dual  problem.  Does  strong  duality  hold? 

5.27  Equality  constrained  least-squares.  Consider  the  equality  constrained  least-squares  prob¬ 
lem 

minimize  ||A*  —  &||i 
subject  to  Gx  =  h 

where  A  £  Rmxn  with  rank  A  =  n,  and  G  £  Rpxn  with  rankG  =  p. 

Give  the  KKT  conditions,  and  derive  expressions  for  the  primal  solution  x*  and  the  dual 
solution  v* . 

5.28  Prove  (without  using  any  linear  programming  code)  that  the  optimal  solution  of  the  LP 


minimize  47*1  +  93*2  +  17*3  —  93*4 


'  -1  -6  1  3  ' 

-1-2  71 

Xl 

'  -3  " 

5 

subject  to 

0  3  -10  -1 

-6  -11  -2  12 

1  6-1-3 

X2 

X3 

X4 

a 

-8 

-7 

4 

is  unique,  and  given  by  x*  =  (1, 1, 1, 1). 

5.29  The  problem 

minimize  —  3*2  +  *1  +  2*§  +  2(*i  +  *2  +  *3) 

subject  to  *1  +  *2  +  *3  =  1, 

is  a  special  case  of  (5.32),  so  strong  duality  holds  even  though  the  problem  is  not  convex. 
Derive  the  KKT  conditions.  Find  all  solutions  *,  v  that  satisfy  the  KKT  conditions. 
Which  pair  corresponds  to  the  optimum? 
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5.30  Derive  the  KKT  conditions  for  the  problem 

minimize  tr  X  —  log  det  X 
subject  to  Xs  =  y, 

with  variable  X  £  S”  and  domain  S"  +  .  y  £  R"  and  s  £  R"  are  given,  with  sTy  =  1. 
Verify  that  the  optimal  solution  is  given  by 

v*  T  T  f  T 

X  =  I  +  yy - Tf^ss  . 

s 1  s 

5.31  Supporting  hyperplane  interpretation  of  KKT  conditions.  Consider  a  convex  problem  with 
no  equality  constraints, 


minimize  fo  ( x ) 

subject  to  fi(x)  <0,  i  =  1, . . . ,  m. 


Assume  that  x *  £  Rn  and  A*  £  Rm  satisfy  the  KKT  conditions 


Show  that 


V/o(z*)  +  E"i 


/<(**) 

A* 

A  ?/<(**) 
A  rv/4(i*) 


< 

> 


0, 

0, 

0, 

0. 


i  =  1, . . . ,  m 
i  —  1, . . . ,  m 
i  —  1, . . . ,  m 


V fo(x*)T (x  —  x*)  >  0 


for  all  feasible  x.  In  other  words  the  KKT  conditions  imply  the  simple  optimality  criterion 
of  §4.2.3. 


Perturbation  and  sensitivity  analysis 

5.32  Optimal  value  of  perturbed  problem.  Let  fo,  fi,  ■  ■  ■ ,  fm  '■  R”1  — >  R  be  convex.  Show  that 
the  function 

p*(u,  v )  =  inf{/o(a;)  |  3a;  £  V,  fi(x)  <  Ui,  i  =  1, . . . ,  m,  Ax  —  b  =  v} 

is  convex.  This  function  is  the  optimal  cost  of  the  perturbed  problem,  as  a  function  of 
the  perturbations  u  and  v  (see  §5.6.1). 

5.33  Parametrized  l\-norm  approximation.  Consider  the  la-norm  minimization  problem 

minimize  ||  Ax  +  b  +  ed||i 

with  variable  x  £  R3,  and 


'  -2 

7 

1  ' 

-4  ' 

'  -10  ' 

-5 

-1 

3 

3 

-13 

-7 

3 

-5 

U  _ 

9 

,  d  = 

-27 

-1 

4 

-4 

»  b  ~ 

0 

-10 

1 

5 

5 

-11 

-7 

2 

-5 

-1 

5 

14 

We  denote  by  p*(e)  the  optimal  value  as  a  function  of  e. 

(a)  Suppose  e  =  0.  Prove  that  x*  =  1  is  optimal.  Are  there  any  other  optimal  points? 

(b)  Show  that  p*(e)  is  affine  on  an  interval  that  includes  e  =  0. 
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5.34  Consider  the  pair  of  primal  and  dual  LPs 


minimize  (c  +  ed)T  x 

subject  to  Ax  X  b  +  tf 


and 

maximize  — ( b  +  ef)Tz 

subject  to  AT z  +  c  +  ed  =  0 
z>0 

where 


-4 

12 

-2 

1  ' 

8  ' 

6  " 

-17 

12 

7 

11 

13 

15 

1 

0 

-6 

1 

b  = 

-4 

/  = 

-13 

3 

3 

22 

-1 

27 

48 

-11 

2 

-1 

-8 

-18 

8 

c  =  (49,  —34,  —50,  —5),  d  =  (3, 8,  21,  25),  and  e  is  a  parameter. 


(a)  Prove  that  x*  =  (1, 1, 1, 1)  is  optimal  when  e  =  0,  by  constructing  a  dual  optimal 
point  z*  that  has  the  same  objective  value  as  x* .  Are  there  any  other  primal  or  dual 
optimal  solutions? 

(b)  Give  an  explicit  expression  for  the  optimal  value  p*(e)  as  a  function  of  t  on  an 
interval  that  contains  e  =  0.  Specify  the  interval  on  which  your  expression  is  valid. 
Also  give  explicit  expressions  for  the  primal  solution  x*(e)  and  the  dual  solution 
z*(e)  as  a  function  of  e,  on  the  same  interval. 

Hint.  First  calculate  x*(e)  and  z*(e),  assuming  that  the  primal  and  dual  constraints 
that  are  active  at  the  optimum  for  e  =  0,  remain  active  at  the  optimum  for  values 
of  e  around  0.  Then  verify  that  this  assumption  is  correct. 


5.35  Sensitivity  analysis  for  GPs.  Consider  a  GP 


minimize  fo{x) 

subject  to  fi(x)  <  1,  i  =  1, . . . ,  m 

hi(x)  =  1,  i  1 ,  r 


where  fo,  ■  ■  ■ ,  fm  are  posynomials,  h\. . . . ,  hv  are  monomials,  and  the  domain  of  the  prob¬ 
lem  is  R++.  We  define  the  perturbed  GP  as 


minimize  fo{x) 

subject  to  fi(x)<eUi,  i  = 

hi(x)  =  eVi,  i  =  1, .  •  •  ,p, 


and  we  denote  the  optimal  value  of  the  perturbed  GP  as  p*(u,  v).  We  can  think  of  it;  and 
Vi  as  relative,  or  fractional,  perturbations  of  the  constraints.  For  example,  u\  =  —0.01 
corresponds  to  tightening  the  first  inequality  constraint  by  (approximately)  1%. 

Let  A*  and  u*  be  optimal  dual  variables  for  the  convex  form  GP 


minimize  log/o(y) 
subject  to  log/i(j/)<0,  i=l,...,m 

l°g  hi(y)  =  0,  i  =  l,...,p, 

with  variables  yi  =  log®;.  Assuming  that  p*(u,  v)  is  differentiable  at  u  =  0,  v  =  0,  relate 
A*  and  u*  to  the  derivatives  of  p*(u,  v)  at  u  =  0,  v  =  0.  Justify  the  statement  “Relaxing 
the  ith  constraint  by  a  percent  will  give  an  improvement  in  the  objective  of  around  aA* 
percent,  for  a  small.” 
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Theorems  of  alternatives 


5.36  Alternatives  for  linear  equalities.  Consider  the  linear  equations  Ax  =  b,  where  A  £  Rmxn. 
From  linear  algebra  we  know  that  this  equation  has  a  solution  if  and  only  b  £  TZ(A ),  which 
occurs  if  and  only  if  b  _L  Af(AT).  In  other  words,  Ax  =  b  has  a  solution  if  and  only  if 
there  exists  no  y  £  Rm  such  that  ATy  =  0  and  bTy  ^  0. 

Derive  this  result  from  the  theorems  of  alternatives  in  §5.8.2. 

5.37  [BT97]  Existence  of  equilibrium  distribution  in  finite  state  Markov  chain.  Let  P  £  Rnx™ 
be  a  matrix  that  satisfies 

Pij>  0,  i,j  =  l,...,n,  PT  1  =  1, 

i.e.,  the  coefficients  are  nonnegative  and  the  columns  sum  to  one.  Use  Farkas’  lemma  to 
prove  there  exists  a  y  £  Rn  such  that 

Py  =  V ,  3/^0,  1  Ty  =  1. 

(We  can  interpret  y  as  an  equilibrium  distribution  of  the  Markov  chain  with  n  states  and 
transition  probability  matrix  P.) 

5.38  [BT97]  Option  pricing.  We  apply  the  results  of  example  5.10,  page  263,  to  a  simple 
problem  with  three  assets:  a  riskless  asset  with  fixed  return  r  >  1  over  the  investment 
period  of  interest  (for  example,  a  bond),  a  stock,  and  an  option  on  the  stock.  The  option 
gives  us  the  right  to  purchase  the  stock  at  the  end  of  the  period,  for  a  predetermined 
price  K. 

We  consider  two  scenarios.  In  the  first  scenario,  the  price  of  the  stock  goes  up  from 
S  at  the  beginning  of  the  period,  to  Su  at  the  end  of  the  period,  where  u  >  r.  In  this 
scenario,  we  exercise  the  option  only  if  Su  >  K,  in  which  case  we  make  a  profit  of  Su—  K. 
Otherwise,  we  do  not  exercise  the  option,  and  make  zero  profit.  The  value  of  the  option 
at  the  end  of  the  period,  in  the  first  scenario,  is  therefore  max{0,  Su  —  A'}. 

In  the  second  scenario,  the  price  of  the  stock  goes  down  from  S  to  Sd,  where  d  <  1.  The 
value  at  the  end  of  the  period  is  max{0,  Sd  —  K}. 

In  the  notation  of  example  5.10, 


r  uS  max{0,  Su  —  K} 
r  dS  max{0,  Sd—  K}  ’ 


Pi  =  1,  P2  =  S,  p3  =  C, 


where  C  is  the  price  of  the  option. 

Show  that  for  given  r,  S,  K,  u,  d,  the  option  price  C  is  uniquely  determined  by  the 
no-arbitrage  condition.  In  other  words,  the  market  for  the  option  is  complete. 


Generalized  inequalities 


5.39  SDP  relaxations  of  two-way  partitioning  problem.  We  consider  the  two-way  partitioning 
problem  (5.7),  described  on  page  219, 


minimize  x1  Wx 

„2 


(5.113) 


with  variable  x  £  Rn 
SDP 


subject  to  xf  =  1,  i  =  1, . . . ,  n, 

The  Lagrange  dual  of  this  (nonconvex)  problem  is  given  by  the 


(5.114) 


maximize  —1  v 
subject  to  W  +  diag(iz)  X  0 

with  variable  v  £  Rn.  The  optimal  value  of  this  SDP  gives  a  lower  bound  on  the  optimal 
value  of  the  partitioning  problem  (5.113).  In  this  exercise  we  derive  another  SDP  that 
gives  a  lower  bound  on  the  optimal  value  of  the  two-way  partitioning  problem,  and  explore 
the  connection  between  the  two  SDPs. 
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5  Duality 


(a)  Two-way  partitioning  problem  in  matrix  form.  Show  that  the  two-way  partitioning 
problem  can  be  cast  as 

minimize  tr(WX) 
subject  to  X  y  0,  rank  A  =  1 
Xu  1,  i  —  1, . . . ,  n, 

with  variable  A  £  Sn.  Hint.  Show  that  if  X  is  feasible,  then  it  has  the  form 
X  =  xxT ,  where  x  £  R"  satisfies  Xi  £  {—1,1}  (and  vice  versa). 

(b)  SDP  relaxation  of  two-way  partitioning  problem.  Using  the  formulation  in  part  (a), 
we  can  form  the  relaxation 

minimize  tr(VUA') 

subject  to  X  y  0  (5.115) 

X a  —  1,  i  —  1, . . . ,  n, 

with  variable  X  £  Sn.  This  problem  is  an  SDP,  and  therefore  can  be  solved  effi¬ 
ciently.  Explain  why  its  optimal  value  gives  a  lower  bound  on  the  optimal  value  of 
the  two-way  partitioning  problem  (5.113).  What  can  you  say  if  an  optimal  point 
X*  for  this  SDP  has  rank  one? 

(c)  We  now  have  two  SDPs  that  give  a  lower  bound  on  the  optimal  value  of  the  two-way 
partitioning  problem  (5.113):  the  SDP  relaxation  (5.115)  found  in  part  (b),  and  the 
Lagrange  dual  of  the  two-way  partitioning  problem,  given  in  (5.114).  What  is  the 
relation  between  the  two  SDPs?  What  can  you  say  about  the  lower  bounds  found 
by  them?  Hint:  Relate  the  two  SDPs  via  duality. 

5.40  E-optimal  experiment  design.  A  variation  on  the  two  optimal  experiment  design  problems 
of  exercise  5.10  is  the  E-optimal  design  problem 

minimize  Amax  (X)f=i  XiVivf)  1 
subject  to  ilO,  1T x  =  1. 

(See  also  §7.5.)  Derive  a  dual  for  this  problem,  by  first  reformulating  it  as 

minimize  1  /t 

subject  to  xiyivJ  'y  H 

x  y  0,  1T x  —  1, 

with  variables  t  £  R,  x  £  Rp  and  domain  R.++  x  Rp,  and  applying  Lagrange  duality. 
Simplify  the  dual  problem  as  much  as  you  can. 

5.41  Dual  of  fastest  mixing  Markov  chain  problem.  On  page  174,  we  encountered  the  SDP 

minimize  t 

subject  to  —tl  ^  P  —  (l/n)llT  ^  tl 

PI  =  1 

Pij  >  0,  i,j  =  1, . . .  ,n 
Pij  =  0  for  (i,  j)  0  £, 

with  variables  t  £  R,  P  £  S". 

Show  that  the  dual  of  this  problem  can  be  expressed  as 

maximize  1 T  z  —  (l/n)lTYl 

subject  to  ||Y||2,  <  1 

(zi  +  Zj)  <  for  (i,j)  £  £ 

with  variables  z  £  R"  and  Y  £  S".  The  norm  ||  •  ||2*  is  the  dual  of  the  spectral  norm 
on  S":  ||Y||2*  =  X/i-i  the  sum  of  the  absolute  values  of  the  eigenvalues  of  Y. 

(See  §A.1.6,  page  637.) 
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5.42  Lagrange  dual  of  conic  form  problem  in  inequality  form.  Find  the  Lagrange  dual  problem 
of  the  conic  form  problem  in  inequality  form 

minimize  cTx 
subject  to  Ax  <k  b 

where  A  £  RmXn,  b  £  Rm,  and  K  is  a  proper  cone  in  Rm.  Make  any  implicit  equality 
constraints  explicit. 

5.43  Dual  of  SOCP.  Show  that  the  dual  of  the  SOCP 

minimize  fTx 

subject  to  \\AiX  +  bi\\2  <  cjx  +  di,  i  m, 

with  variables  x  £  R",  can  be  expressed  as 

maximize  ^  "L  x  ( bj  Ui  —  di  Vi ) 

subject  to  Y^'iLiiAfui  -  Ciyi)  +  f  =  0 
|| u* || 2  <  Vi,  i  =  1, . . .  ,m, 

with  variables  Ui  £  R"' ,  Vi  £  R,  *  =  1, . . . ,  m.  The  problem  data  are  /  £  R",  Ai  £  R"iXn, 
bi  £  R”® ,  d  £  R  and  di  £  R,  i  =  1, . . . ,  m. 

Derive  the  dual  in  the  following  two  ways. 

(a)  Introduce  new  variables  yi  £  Rn®  and  ti  £  R  and  equalities  yi  =  AiX  +  bi,  ti  = 
cjx  +  di,  and  derive  the  Lagrange  dual. 

(b)  Start  from  the  conic  formulation  of  the  SOCP  and  use  the  conic  dual.  Use  the  fact 
that  the  second-order  cone  is  self-dual. 

5.44  Strong  alternatives  for  nonstrict  LMIs.  In  example  5.14,  page  270,  we  mentioned  that 
the  system 

ZtO,  tr(GZ)  >  0,  tr(FiZ)  =  0,  i  =  l,...,n,  (5.116) 

is  a  strong  alternative  for  the  nonstrict  LMI 

F(x)  =  xiFi  +  •  •  •  +  xnFn  +  G  X  0,  (5.117) 

if  the  matrices  F)  satisfy 


^ViFi  y  0  ==>  ^ViFi  =  0.  (5.118) 

i=l  i=  1 

In  this  exercise  we  prove  this  result,  and  give  an  example  to  illustrate  that  the  systems 
are  not  always  strong  alternatives. 

(a)  Suppose  (5.118)  holds,  and  that  the  optimal  value  of  the  auxiliary  SDP 


minimize  s 
subject  to  F(x)  A  si 


is  positive.  Show  that  the  optimal  value  is  attained.  If  follows  from  the  discussion 
in  §5.9.4  that  the  systems  (5.117)  and  (5.116)  are  strong  alternatives. 

Hint.  The  proof  simplifies  if  you  assume,  without  loss  of  generality,  that  the  matrices 
Fi,  . . . ,  Fn  are  independent,  so  (5.118)  may  be  replaced  by  Xu=i  Vi^'i  ^  0  =>  u  =  0. 

(b)  Take  n  =  1,  and 


G  = 


0  1 
1  0  ’ 


F\  = 


Show  that  (5.117)  and  (5.116)  are  both  infeasible. 


0  0 
0  1 


Part  II 

Applications 


Chapter  6 


Approximation  and  fitting 


6.1  Norm  approximation 

6.1.1  Basic  norm  approximation  problem 

The  simplest  norm  approximation  problem  is  an  unconstrained  problem  of  the  form 

minimize  ||  Ax  —  b\\  (6.1) 

where  A  £  Rmxn  ancl  b  g  Rm  are  problem  data,  x  £  R"  is  the  variable,  and  ||  •  ||  is 
a  norm  on  Rm.  A  solution  of  the  norm  approximation  problem  is  sometimes  called 
an  approximate  solution  of  Ax  «  b,  in  the  norm  ||  •  ||.  The  vector 


r  =  Ax  —  b 


is  called  the  residual  for  the  problem;  its  components  are  sometimes  called  the 
individual  residuals  associated  with  x. 

The  norm  approximation  problem  (6.1)  is  a  convex  problem,  and  is  solvable, 
i.e.,  there  is  always  at  least  one  optimal  solution.  Its  optimal  value  is  zero  if 
and  only  if  b  £  1Z(A)\  the  problem  is  more  interesting  and  useful,  however,  when 
b  ^  71(A).  We  can  assume  without  loss  of  generality  that  the  columns  of  A  are 
independent;  in  particular,  that  m  >  n.  When  m  =  n  the  optimal  point  is  simply 
A~1b,  so  we  can  assume  that  m  >  n. 

Approximation  interpretation 

By  expressing  Ax  as 

Ax  =  x\a\  +  ■  ■  ■  +  xnan , 

where  a±, ...  ,an  £  Rm  are  the  columns  of  A ,  we  see  that  the  goal  of  the  norm 
approximation  problem  is  to  fit  or  approximate  the  vector  b  by  a  linear  combination 
of  the  columns  of  A1  as  closely  as  possible,  with  deviation  measured  in  the  norm 

Ml- 

The  approximation  problem  is  also  called  the  regression  problem.  In  this  context 
the  vectors  ai,...,an  are  called  the  regressors,  and  the  vector  x\a\  +  •  •  •  +  xnan, 
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where  x  is  an  optimal  solution  of  the  problem,  is  called  the  regression  of  b  (onto 
the  regressors). 

Estimation  interpretation 

A  closely  related  interpretation  of  the  norm  approximation  problem  arises  in  the 
problem  of  estimating  a  parameter  vector  on  the  basis  of  an  imperfect  linear  vector 
measurement.  We  consider  a  linear  measurement  model 

■y  =  Ax  +  v, 

where  y  £  Rm  is  a  vector  measurement,  x  £  R"  is  a  vector  of  parameters  to  be 
estimated,  and  v  £  Rm  is  some  measurement  error  that  is  unknown,  but  presumed 
to  be  small  (in  the  norm  ||  •  ||).  The  estimation  problem  is  to  make  a  sensible  guess 
as  to  what  x  is,  given  y. 

If  we  guess  that  x  has  the  value  x,  then  we  are  implicitly  making  the  guess  that 
v  has  the  value  y  —  Ax.  Assuming  that  smaller  values  of  v  (measured  by  ||  •  ||)  are 
more  plausible  than  larger  values,  the  most  plausible  guess  for  x  is 

x  =  argmku||Az  —  y\\. 

(These  ideas  can  be  expressed  more  formally  in  a  statistical  framework;  see  chap¬ 
ter  7.) 

Geometric  interpretation 

We  consider  the  subspace  A  =  7 Z(A)  C  Rm,  and  a  point  b  £  Rm.  A  projection  of 
the  point  b  onto  the  subspace  A,  in  the  norm  ||  •  ||,  is  any  point  in  A  that  is  closest 
to  b,  i.e.,  any  optimal  point  for  the  problem 

minimize  \[u  —  b\\ 
subject  to  u  £  A. 

Parametrizing  an  arbitrary  element  of  1Z(A)  as  u  =  Ax,  we  see  that  solving  the 
norm  approximation  problem  (6.1)  is  equivalent  to  computing  a  projection  of  b 
onto  A. 

Design  interpretation 

We  can  interpret  the  norm  approximation  problem  (6.1)  as  a  problem  of  optimal 
design.  The  n  variables  xi,...,xn  are  design  variables  whose  values  are  to  be 
determined.  The  vector  y  =  Ax  gives  a  vector  of  m  results,  which  we  assume  to 
be  linear  functions  of  the  design  variables  x.  The  vector  b  is  a  vector  of  target  or 
desired  results.  The  goal  is  to  choose  a  vector  of  design  variables  that  achieves,  as 
closely  as  possible,  the  desired  results,  i.e.,  Ax  «  b.  We  can  interpret  the  residual 
vector  r  as  the  deviation  between  the  actual  results  (i.e.,  Ax)  and  the  desired 
or  target  results  (i.e.,  b).  If  we  measure  the  quality  of  a  design  by  the  norm  of 
the  deviation  between  the  actual  results  and  the  desired  results,  then  the  norm 
approximation  problem  (6.1)  is  the  problem  of  finding  the  best  design. 
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Weighted  norm  approximation  problems 

An  extension  of  the  norm  approximation  problem  is  the  weighted  norm  approxima¬ 
tion  problem 

minimize  1 1 W  {Ax  —  b)  \  \ 

where  the  problem  data  W  £  Rmxm  js  called  the  weighting  matrix.  The  weight¬ 
ing  matrix  is  often  diagonal,  in  which  case  it  gives  different  relative  emphasis  to 
different  components  of  the  residual  vector  r  =  Ax  —  b. 

The  weighted  norm  problem  can  be  considered  as  a  norm  approximation  prob¬ 
lem  with  norm  ||  •  || ,  and  data  A  =  WA,  b  =  Wb,  and  therefore  treated  as  a  standard 
norm  approximation  problem  (6.1).  Alternatively,  the  weighted  norm  approxima¬ 
tion  problem  can  be  considered  a  norm  approximation  problem  with  data  A  and 
b,  and  the  W -weighted  norm  defined  by 

\\4w  =  \\Wz\\ 

(assuming  here  that  W  is  nonsingular). 

Least-squares  approximation 

The  most  common  norm  approximation  problem  involves  the  Euclidean  or 
nornr.  By  squaring  the  objective,  we  obtain  an  equivalent  problem  which  is  called 
the  least-squares  approximation  problem , 

minimize  || Ax  —  b\\\  =  r\  +  H - +  r 

where  the  objective  is  the  sum  of  squares  of  the  residuals.  This  problem  can  be 
solved  analytically  by  expressing  the  objective  as  the  convex  quadratic  function 

f{x)  =  xT  AT  Ax  —  2  bT  Ax  +  bTb. 

A  point  x  minimizes  /  if  and  only  if 

V/( x)  =  2AtAx  -  2 ATb  =  0, 

i.e.,  if  and  only  if  x  satisfies  the  so-called  normal  equations 

AT  Ax  =  ATb, 

which  always  have  a  solution.  Since  we  assume  the  columns  of  A  are  independent, 
the  least-squares  approximation  problem  has  the  unique  solution  x  =  {AT A)^1  ATb. 

Chebyshev  or  minimax  approximation 

When  the  ^oo-norm  is  used,  the  norm  approximation  problem 

minimize  || Ax  -  6||oo  =  max{|n|, . . . ,  |rm|} 

is  called  the  Chebyshev  approximation  problem,  or  minimax  approximation  problem, 
since  we  are  to  minimize  the  maximum  (absolute  value)  residual.  The  Chebyshev 
approximation  problem  can  be  cast  as  an  LP 

minimize  t 

subject  to  —  tl  A  Ax  —  b  A  tl, 
with  variables  x  £  R™  and  t  £  R. 
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Sum  of  absolute  residuals  approximation 

When  the  ^i-norm  is  used,  the  norm  approximation  problem 

minimize  || Ax  —  6|| i  =  |ri|  -\ - +  \rm\ 

is  called  the  sum  of  (absolute)  residuals  approximation  problem,  or,  in  the  context 
of  estimation,  a  robust  estimator  (for  reasons  that  will  be  clear  soon).  Like  the 
Chebyshev  approximation  problem,  the  fi-nornr  approximation  problem  can  be 
cast  as  an  LP 

minimize  1  Tt 

subject  to  —t  A  Ax  —  b  At, 

with  variables  x  £  R™  and  t  £  Rm. 


6.1.2  Penalty  function  approximation 

In  £p-norm  approximation,  for  1  <  p  <  oo,  the  objective  is 

(\r1\p  +  ---  +  \rmn1/p. 

As  in  least-squares  problems,  we  can  consider  the  equivalent  problem  with  objective 

|?’i|p  H - +  \rm\p, 

which  is  a  separable  and  symmetric  function  of  the  residuals.  In  particular,  the 
objective  depends  only  on  the  amplitude  distribution  of  the  residuals,  i.e.,  the 
residuals  in  sorted  order. 

We  will  consider  a  useful  generalization  of  the  £p-norm  approximation  problem, 
in  which  the  objective  depends  only  on  the  amplitude  distribution  of  the  residuals. 
The  penalty  function  approximation  problem  has  the  form 

minimize  </>(n)  H - b  4>{rm)  ,  . 

subject  to  r  =  Ax  —  b,  \  ) 

where  </>  :  R  — >  R  is  called  the  (residual)  penalty  function.  We  assume  that  </>  is 
convex,  so  the  penalty  function  approximation  problem  is  a  convex  optimization 
problem.  In  many  cases,  the  penalty  function  <f>  is  symmetric,  nonnegative,  and 
satisfies  0(0)  =  0,  but  we  will  not  use  these  properties  in  our  analysis. 

Interpretation 

We  can  interpret  the  penalty  function  approximation  problem  (6.2)  as  follows.  For 
the  choice  x ,  we  obtain  the  approximation  Ax  of  b,  which  has  the  associated  resid¬ 
ual  vector  r.  A  penalty  function  assesses  a  cost  or  penalty  for  each  component 
of  residual,  given  by  0(rj);  the  total  penalty  is  the  sum  of  the  penalties  for  each 

residual,  i.e.,  0(ri)  -| - +  0(rm).  Different  choices  of  x  lead  to  different  resulting 

residuals,  and  therefore,  different  total  penalties.  In  the  penalty  function  approxi¬ 
mation  problem,  we  minimize  the  total  penalty  incurred  by  the  residuals. 
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u 


Figure  6.1  Some  common  penalty  functions:  the  quadratic  penalty  function 
</>( u)  =  it2,  the  deadzone-linear  penalty  function  with  deadzone  width  a  = 
1/4,  and  the  log  barrier  penalty  function  with  limit  a  =  1. 


Example  6.1  Some  common  penalty  functions  and  associated  approximation  problems. 


•  By  taking  </>(u)  =  |m|p,  where  p  >  1,  the  penalty  function  approximation  prob¬ 
lem  is  equivalent  to  the  f/,-norm  approximation  problem.  In  particular,  the 
quadratic  penalty  function  </>(u)  =  u 2  yields  least-squares  or  Euclidean  norm 
approximation,  and  the  absolute  value  penalty  function  rf>(u)  =  |u|  yields  fa- 
norm  approximation. 

•  The  deadzone-linear  penalty  function  (with  deadzone  width  a  >  0)  is  given  by 


<j>{u)  = 


0  \u\  <  a 

|m|  —  a  |w|  >  a. 


The  deadzone-linear  function  assesses  no  penalty  for  residuals  smaller  than  a. 
•  The  log  barrier  penalty  function  (with  limit  a  >  0)  has  the  form 


— a2  log(l  —  (u/a)2)  \u\  <  a 

oo  |m|  >  a. 


The  log  barrier  penalty  function  assesses  an  infinite  penalty  for  residuals  larger 
than  a. 


A  deadzone-linear,  log  barrier,  and  quadratic  penalty  function  are  plotted  in  fig¬ 
ure  6.1.  Note  that  the  log  barrier  function  is  very  close  to  the  quadratic  penalty  for 
\u/a\  <  0.25  (see  exercise  6.1). 


Scaling  the  penalty  function  by  a  positive  number  does  not  affect  the  solution  of 
the  penalty  function  approximation  problem,  since  this  merely  scales  the  objective 
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function.  But  the  shape  of  the  penalty  function  has  a  large  effect  on  the  solution  of 
the  penalty  function  approximation  problem.  Roughly  speaking,  </>(it)  is  a  measure 
of  our  dislike  of  a  residual  of  value  u.  If  </>  is  very  small  (or  even  zero)  for  small 
values  of  u ,  it  means  we  care  very  little  (or  not  at  all)  if  residuals  have  these  values. 
If  (f>(u )  grows  rapidly  as  u  becomes  large,  it  means  we  have  a  strong  dislike  for 
large  residuals;  if  <j)  becomes  infinite  outside  some  interval,  it  means  that  residuals 
outside  the  interval  are  unacceptable.  This  simple  interpretation  gives  insight  into 
the  solution  of  a  penalty  function  approximation  problem,  as  well  as  guidelines  for 
choosing  a  penalty  function. 

As  an  example,  let  us  compare  fi-norm  and  fd-norm  approximation,  associ¬ 
ated  with  the  penalty  functions  (f>i(u)  =  |u|  and  </>2 (it)  =  u2,  respectively.  For 
|  it  |  =  1,  the  two  penalty  functions  assign  the  same  penalty.  For  small  u  we  have 
4>i(u)  (j) 2(u),  so  fd-norm  approximation  puts  relatively  larger  emphasis  on  small 

residuals  compared  to  fd-norm  approximation.  For  large  u  we  have  <^i(n), 

so  fa-norm  approximation  puts  less  weight  on  large  residuals,  compared  to  fd-norm 
approximation.  This  difference  in  relative  weightings  for  small  and  large  residuals 
is  reflected  in  the  solutions  of  the  associated  approximation  problems.  The  ampli¬ 
tude  distribution  of  the  optimal  residual  for  the  i\  -norm  approximation  problem 
will  tend  to  have  more  zero  and  very  small  residuals,  compared  to  the  fd-norm  ap¬ 
proximation  solution.  In  contrast,  the  td-norm  solution  will  tend  to  have  relatively 
fewer  large  residuals  (since  large  residuals  incur  a  much  larger  penalty  in  fd-norm 
approximation  than  in  fd-norm  approximation). 

Example 

An  example  will  illustrate  these  ideas.  We  take  a  matrix  A  £  pd00*30  anc[  vec^or 
b  £  R100  (chosen  at  random,  but  the  results  are  typical),  and  compute  the  Id-norm 
and  ^2-norm  approximate  solutions  of  Ax  ss  b ,  as  well  as  the  penalty  function 
approximations  with  a  deadzone-linear  penalty  (with  a  =  0.5)  and  log  barrier 
penalty  (with  a  =  1).  Figure  6.2  shows  the  four  associated  penalty  functions, 
and  the  amplitude  distributions  of  the  optimal  residuals  for  these  four  penalty 
approximations.  From  the  plots  of  the  penalty  functions  we  note  that 

•  The  fd-norm  penalty  puts  the  most  weight  on  small  residuals  and  the  least 
weight  on  large  residuals. 

•  The  fd-norm  penalty  puts  very  small  weight  on  small  residuals,  but  strong 
weight  on  large  residuals. 

•  The  deadzone-linear  penalty  function  puts  no  weight  on  residuals  smaller 
than  0.5,  and  relatively  little  weight  on  large  residuals. 

•  The  log  barrier  penalty  puts  weight  very  much  like  the  fd-norm  penalty  for 
small  residuals,  but  puts  very  strong  weight  on  residuals  larger  than  around 
0.8,  and  infinite  weight  on  residuals  larger  than  1. 

Several  features  are  clear  from  the  amplitude  distributions: 

•  For  the  £i-optimal  solution,  many  residuals  are  either  zero  or  very  small.  The 
f'l-optimal  solution  also  has  relatively  more  large  residuals. 


Log  barrier  Deadzone 
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Figure  6.2  Histogram  of  residual  amplitudes  for  four  penalty  functions,  with 
the  (scaled)  penalty  functions  also  shown  for  reference.  For  the  log  barrier 
plot,  the  quadratic  penalty  is  also  shown,  in  dashed  curve. 
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Figure  6.3  A  (nonconvex)  penalty  function  that  assesses  a  fixed  penalty  to 
residuals  larger  than  a  threshold  (which  in  this  example  is  one):  </>(u)  =  u 2 
if  |u|  <  1  and  (j>{u)  =  1  if  |u|  >1.  As  a  result,  penalty  approximation  with 
this  function  would  be  relatively  insensitive  to  outliers. 


•  The  t^-norm  approximation  has  many  modest  residuals,  and  relatively  few 
larger  ones. 

•  For  the  deadzone-linear  penalty,  we  see  that  many  residuals  have  the  value 
±0.5,  right  at  the  edge  of  the  ‘free’  zone,  for  which  no  penalty  is  assessed. 

•  For  the  log  barrier  penalty,  we  see  that  no  residuals  have  a  magnitude  larger 
than  1,  but  otherwise  the  residual  distribution  is  similar  to  the  residual  dis¬ 
tribution  for  f'2-norm  approximation. 


Sensitivity  to  outliers  or  large  errors 


In  the  estimation  or  regression  context,  an  outlier  is  a  measurement  y,;  =  af  x  ±  v j 
for  which  the  noise  Vi  is  relatively  large.  This  is  often  associated  with  faulty  data 
or  a  flawed  measurement.  When  outliers  occur,  any  estimate  of  x  will  be  associated 
with  a  residual  vector  with  some  large  components.  Ideally  we  would  like  to  guess 
which  measurements  are  outliers,  and  either  remove  them  from  the  estimation 
process  or  greatly  lower  their  weight  in  forming  the  estimate.  (We  cannot,  however, 
assign  zero  penalty  for  very  large  residuals,  because  then  the  optimal  point  would 
likely  make  all  residuals  large,  which  yields  a  total  penalty  of  zero.)  This  could  be 
accomplished  using  penalty  function  approximation,  with  a  penalty  function  such 
as 


u2  |it|  <  M 
M 2  |uj  >  M, 


(6.3) 


shown  in  figure  6.3.  This  penalty  function  agrees  with  least-squares  for  any  residual 
smaller  than  M,  but  puts  a  fixed  weight  on  any  residual  larger  than  M,  no  matter 
how  much  larger  it  is.  In  other  words,  residuals  larger  than  M  are  ignored;  they 
are  assumed  to  be  associated  with  outliers  or  bad  data.  Unfortunately,  the  penalty 
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Figure  6.4  The  solid  line  is  the  robust  least-squares  or  Huber  penalty  func¬ 
tion  </>hub,  with  M  =  1.  For  |«|  <  M  it  is  quadratic,  and  for  |u|  >  M  it 
grows  linearly. 


function  (6.3)  is  not  convex,  and  the  associated  penalty  function  approximation 
problem  becomes  a  hard  combinatorial  optimization  problem. 

The  sensitivity  of  a  penalty  function  based  estimation  method  to  outliers  de¬ 
pends  on  the  (relative)  value  of  the  penalty  function  for  large  residuals.  If  we 
restrict  ourselves  to  convex  penalty  functions  (which  result  in  convex  optimization 
problems) ,  the  ones  that  are  least  sensitive  are  those  for  which  cj>(u)  grows  linearly, 
i.e.,  like  |it],  for  large  u.  Penalty  functions  with  this  property  are  sometimes  called 
robust,  since  the  associated  penalty  function  approximation  methods  are  much  less 
sensitive  to  outliers  or  large  errors  than,  for  example,  least-squares. 

One  obvious  example  of  a  robust  penalty  function  is  <f>(u)  =  |it|,  corresponding 
to  fu-norm  approximation.  Another  example  is  the  robust  least-squares  or  Huber 
penalty  function,  given  by 

,  ,  ,  \  u2  Id  <  M  ,  , 

0hub(u)  -  |  M(2|u|  _  M)  |u|  >  M,  (6-4) 

shown  in  figure  6.4.  This  penalty  function  agrees  with  the  least-squares  penalty 
function  for  residuals  smaller  than  M ,  and  then  reverts  to  td-like  linear  growth  for 
larger  residuals.  The  Huber  penalty  function  can  be  considered  a  convex  approx¬ 
imation  of  the  outlier  penalty  function  (6.3),  in  the  following  sense:  They  agree 
for  |it|  <  M,  and  for  |it|  >  M ,  the  Huber  penalty  function  is  the  convex  function 
closest  to  the  outlier  penalty  function  (6.3). 


Example  6.2  Robust  regression.  Figure  6.5  shows  42  points  {ti,yf)  in  a  plane,  with 
two  obvious  outliers  (one  at  the  upper  left,  and  one  at  lower  right).  The  dashed  line 
shows  the  least-squares  approximation  of  the  points  by  a  straight  line  f(t)  =  a  +  fit. 
The  coefficients  a  and  /3  are  obtained  by  solving  the  least-squares  problem 

minimize  ~  01  ~  Pti)2, 
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Figure  6.5  The  42  circles  show  points  that  can  be  well  approximated  by 
an  affine  function,  except  for  the  two  outliers  at  upper  left  and  lower  right. 
The  dashed  line  is  the  least-squares  fit  of  a  straight  line  f(t)  =  a  +  fit 
to  the  points,  and  is  rotated  away  from  the  main  locus  of  points,  toward 
the  outliers.  The  solid  line  shows  the  robust  least-squares  fit,  obtained  by 
minimizing  Huber’s  penalty  function  with  M  =  1.  This  gives  a  far  better  fit 
to  the  non-outlier  data. 


with  variables  a  and  /3.  The  least-squares  approximation  is  clearly  rotated  away  from 
the  main  locus  of  the  points,  toward  the  two  outliers. 

The  solid  line  shows  the  robust  least-squares  approximation,  obtained  by  minimizing 
the  Huber  penalty  function 

minimize  YllLi  ‘(’hub (yt  -  a-  fiU), 
with  M  =  1.  This  approximation  is  far  less  affected  by  the  outliers. 


Since  fu-norm  approximation  is  among  the  (convex)  penalty  function  approxi¬ 
mation  methods  that  are  most  robust  to  outliers,  fi-norm  approximation  is  some¬ 
times  called  robust  estimation  or  robust  regression.  The  robustness  property  of 
fn-nornr  estimation  can  also  be  understood  in  a  statistical  framework;  see  page  353. 

Small  residuals  and  f'1-norm  approximation 

We  can  also  focus  on  small  residuals.  Least-squares  approximation  puts  very  small 
weight  on  small  residuals,  since  <f>(u)  =  u2  is  very  small  when  u  is  small.  Penalty 
functions  such  as  the  deadzone-linear  penalty  function  put  zero  weight  on  small 
residuals.  For  penalty  functions  that  are  very  small  for  small  residuals,  we  expect 
the  optimal  residuals  to  be  small,  but  not  very  small.  Roughly  speaking,  there  is 
little  or  no  incentive  to  drive  small  residuals  smaller. 

In  contrast,  penalty  functions  that  put  relatively  large  weight  on  small  residuals, 
such  as  4>{u)  =  |tt],  corresponding  to  fn-nornr  approximation,  tend  to  produce 
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optimal  residuals  many  of  which  are  very  small,  or  even  exactly  zero.  This  means 
that  in  ti-norm  approximation,  we  typically  find  that  many  of  the  equations  are 
satisfied  exactly,  i.e.,  we  have  ajx  =  bi  for  many  i.  This  phenomenon  can  be  seen 
in  figure  6.2. 


6.1.3  Approximation  with  constraints 

It  is  possible  to  add  constraints  to  the  basic  norm  approximation  problem  (6.1). 
When  these  constraints  are  convex,  the  resulting  problem  is  convex.  Constraints 
arise  for  a  variety  of  reasons. 

•  In  an  approximation  problem,  constraints  can  be  used  to  rule  out  certain  un¬ 
acceptable  approximations  of  the  vector  6,  or  to  ensure  that  the  approximator 
Ax  satisfies  certain  properties. 

•  In  an  estimation  problem,  the  constraints  arise  as  prior  knowledge  of  the 
vector  x  to  be  estimated,  or  from  prior  knowledge  of  the  estimation  error  v. 

•  Constraints  arise  in  a  geometric  setting  in  determining  the  projection  of  a 
point  b  on  a  set  more  complicated  than  a  subspace,  for  example,  a  cone  or 
polyhedron. 

Some  examples  will  make  these  clear. 

Nonnegativity  constraints  on  variables 

We  can  add  the  constraint  x  ^  0  to  the  basic  norm  approximation  problem: 

minimize  ||  Ax  —  6|| 
subject  to  x  >:  0. 

In  an  estimation  setting,  nonnegativity  constraints  arise  when  we  estimate  a  vector 
x  of  parameters  known  to  be  nonnegative,  e.g.,  powers,  intensities,  or  rates.  The 
geometric  interpretation  is  that  we  are  determining  the  projection  of  a  vector  b  onto 
the  cone  generated  by  the  columns  of  A.  We  can  also  interpret  this  problem  as 
approximating  b  using  a  nonnegative  linear  (i.e.,  conic)  combination  of  the  columns 
of  A. 

Variable  bounds 

Here  we  add  the  constraint  l  A  x  A  u,  where  l,  u  £  Rn  are  problem  parameters: 

minimize  \\Ax  —  6|| 
subject  to  l  A  x  A  u. 

In  an  estimation  setting,  variable  bounds  arise  as  prior  knowledge  of  intervals  in 
which  each  variable  lies.  The  geometric  interpretation  is  that  we  are  determining 
the  projection  of  a  vector  b  onto  the  image  of  a  box  under  the  linear  mapping 
induced  by  A. 
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Probability  distribution 

We  can  impose  the  constraint  that  x  satisfy  x  >r  0,  lTx  =  1: 

minimize  \\Ax  —  6|| 
subject  to  x  >:  0,  lrx  =  1. 

This  would  arise  in  the  estimation  of  proportions  or  relative  frequencies,  which  are 
nonnegative  and  sum  to  one.  It  can  also  be  interpreted  as  approximating  b  by  a 
convex  combination  of  the  columns  of  A.  (We  will  have  much  more  to  say  about 
estimating  probabilities  in  §7.2.) 

Norm  ball  constraint 

We  can  add  to  the  basic  norm  approximation  problem  the  constraint  that  x  lie  in 
a  norm  ball: 

minimize  ||  Ax  —  6|| 
subject  to  \\x  —  xo||  <  d, 

where  xo  and  d  are  problem  parameters.  Such  a  constraint  can  be  added  for  several 
reasons. 

•  In  an  estimation  setting,  Xq  is  a  prior  guess  of  what  the  parameter  x  is,  and  d 
is  the  maximum  plausible  deviation  of  our  estimate  from  our  prior  guess.  Our 
estimate  of  the  parameter  x  is  the  value  x  which  best  matches  the  measured 
data  (i.e.,  minimizes  || Az  —  6||)  among  all  plausible  candidates  (i.e.,  z  that 
satisfy  \\z  —  a>o||  <  d). 

•  The  constraint  ||x— xqII  <  d  can  denote  a  trust  region.  Here  the  linear  relation 
y  =  Ax  is  only  an  approximation  of  some  nonlinear  relation  y  =  f(x)  that  is 
valid  when  x  is  near  some  point  Xq,  specifically  ||®  —  xo||  <  d.  The  problem 
is  to  minimize  \\Ax  —  b\\  but  only  over  those  x  for  which  the  model  y  =  Ax  is 
trusted. 

These  ideas  also  come  up  in  the  context  of  regularization;  see  §6.3.2. 


6.2  Least-norm  problems 

The  basic  least-norm  problem  has  the  form 

minimize  llaHI 

subject  to  Ax  =  b  ^ 

where  the  data  are  A  £  Rmxrl  and  b  £  Rm,  the  variable  is  a;  £  Rra,  and  II  •  II  is  a 
norm  on  R" .  A  solution  of  the  problem,  which  always  exists  if  the  linear  equations 
Ax  =  b  have  a  solution,  is  called  a  least-norm  solution  of  Ax  =  b.  The  least-norm 
problem  is,  of  course,  a  convex  optimization  problem. 

We  can  assume  without  loss  of  generality  that  the  rows  of  A  are  independent,  so 
to  <  n.  When  m  =  n,  the  only  feasible  point  is  x  =  A-16;  the  least-norm  problem 
is  interesting  only  when  to  <  n,  i.e.,  when  the  equation  Ax  =  b  is  underdetermined. 
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Reformulation  as  norm  approximation  problem 

The  least-norm  problem  (6.5)  can  be  formulated  as  a  norm  approximation  problem 
by  eliminating  the  equality  constraint.  Let  xo  be  any  solution  of  Ax  =  b ,  and  let 
Z  G  R”xfc  be  a  matrix  whose  columns  are  a  basis  for  the  nullspace  of  A.  The 
general  solution  of  Ax  =  b  can  then  be  expressed  as  Xo  +  Zu  where  u  G  Rfc.  The 
least-norm  problem  (6.5)  can  be  expressed  as 

minimize  ||xo  +  ^u||, 

with  variable  u  G  Rfe,  which  is  a  norm  approximation  problem.  In  particular, 
our  analysis  and  discussion  of  norm  approximation  problems  applies  to  least-norm 
problems  as  well  (when  interpreted  correctly). 

Control  or  design  interpretation 

We  can  interpret  the  least-norm  problem  (6.5)  as  a  problem  of  optimal  design  or 
optimal  control.  The  n  variables  Xi, . . .  ,xn  are  design  variables  whose  values  are 
to  be  determined.  In  a  control  setting,  the  variables  X\, . .  .,xn  represent  inputs, 
whose  values  we  are  to  choose.  The  vector  y  =  Ax  gives  m  attributes  or  results  of 
the  design  x,  which  we  assume  to  be  linear  functions  of  the  design  variables  x.  The 
to  <  n  equations  Ax  =  b  represent  to  specifications  or  requirements  on  the  design. 
Since  to  <  n,  the  design  is  underspecified;  there  are  n  —  m  degrees  of  freedom  in 
the  design  (assuming  A  is  rank  in). 

Among  all  the  designs  that  satisfy  the  specifications,  the  least-norm  problem 
chooses  the  smallest  design,  as  measured  by  the  norm  ||  •  ||.  This  can  be  thought  of 
as  the  most  efficient  design,  in  the  sense  that  it  achieves  the  specifications  Ax  =  b, 
with  the  smallest  possible  x. 

Estimation  interpretation 

We  assume  that  a:  is  a  vector  of  parameters  to  be  estimated.  We  have  m  <  n 
perfect  (noise  free)  linear  measurements,  given  by  Ax  =  b.  Since  we  have  fewer 
measurements  than  parameters  to  estimate,  our  measurements  do  not  completely 
determine  x.  Any  parameter  vector  x  that  satisfies  Ax  =  b  is  consistent  with  our 
measurements . 

To  make  a  good  guess  about  what  x  is,  without  taking  further  measurements, 
we  must  use  prior  information.  Suppose  our  prior  information,  or  assumption,  is 
that  x  is  more  likely  to  be  small  (as  measured  by  ||  •  ||)  than  large.  The  least-norm 
problem  chooses  as  our  estimate  of  the  parameter  vector  x  the  one  that  is  smallest 
(hence,  most  plausible)  among  all  parameter  vectors  that  are  consistent  with  the 
measurements  Ax  =  b.  (For  a  statistical  interpretation  of  the  least-norm  problem, 
see  page  359.) 

Geometric  interpretation 

We  can  also  give  a  simple  geometric  interpretation  of  the  least-norm  problem  (6.5). 
The  feasible  set  {x  |  Ax  =  6}  is  affine,  and  the  objective  is  the  distance  (measured 
by  the  norm  ||  •  ||)  between  x  and  the  point  0.  The  least-norm  problem  finds  the 
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point  in  the  affine  set  with  minimum  distance  to  0,  i.e.,  it  determines  the  projection 
of  the  point  0  on  the  affine  set  {x  |  Ax  =  b}. 

Least-squares  solution  of  linear  equations 

The  most  common  least-norm  problem  involves  the  Euclidean  or  I?2-norm.  By 
squaring  the  objective  we  obtain  the  equivalent  problem 

minimize  IMH 
subject  to  Ax  =  b , 

the  unique  solution  of  which  is  called  the  least-squares  solution  of  the  equations 
Ax  =  b.  Like  the  least-squares  approximation  problem,  this  problem  can  be  solved 
analytically.  Introducing  the  dual  variable  v  G  Rm ,  the  optimality  conditions  are 

2x*  +  ATv *  =  0,  Ax*  =  b , 

which  is  a  pair  of  linear  equations,  and  readily  solved.  From  the  first  equation 
we  obtain  x*  =  —  (1/2)Ati/*;  substituting  this  into  the  second  equation  we  obtain 
—  {1/2)AAtv*  =  6,  and  conclude 

v*  =  ~2{AAT)~1b,  x*  =  AT(AAT)~1b. 

(Since  rank  A  =  m  <  n,  the  matrix  AAT  is  invertible.) 

Least-penalty  problems 

A  useful  variation  on  the  least-norm  problem  (6.5)  is  the  least-penalty  problem 

minimize  4>(xi)  H - b  </>(£„)  /p 

subject  to  Ax  =  b,  ' 

where  (f>  :  R  — >  R  is  convex,  nonnegative,  and  satisfies  </>(0)  =  0.  The  penalty 
function  value  (j>{u)  quantifies  our  dislike  of  a  component  of  x  having  value  u ; 
the  least-penalty  problem  then  finds  x  that  has  least  total  penalty,  subject  to  the 
constraint  Ax  =  b. 

All  of  the  discussion  and  interpretation  of  penalty  functions  in  penalty  function 
approximation  can  be  transposed  to  the  least-penalty  problem,  by  substituting 
the  amplitude  distribution  of  x  (in  the  least-penalty  problem)  for  the  amplitude 
distribution  of  the  residual  r  (in  the  penalty  approximation  problem). 

Sparse  solutions  via  least  td-norm 

Recall  from  the  discussion  on  page  300  that  fd-norm  approximation  gives  relatively 
large  weight  to  small  residuals,  and  therefore  results  in  many  optimal  residuals 
small,  or  even  zero.  A  similar  effect  occurs  in  the  least- norm  context.  The  least 
fd-nornr  problem, 

minimize  |M|i 
subject  to  Ax  =  6, 

tends  to  produce  a  solution  x  with  a  large  number  of  components  equal  to  zero. 
In  other  words,  the  least  Id-norm  problem  tends  to  produce  sparse  solutions  of 
Ax  =  b,  often  with  m  nonzero  components. 
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It  is  easy  to  find  solutions  of  Ax  =  b  that  have  only  in  nonzero  components. 
Choose  any  set  of  m  indices  (out  of  1 , . . .  ,n)  which  are  to  be  the  nonzero  com¬ 
ponents  of  x.  The  equation  Ax  =  b  reduces  to  Ax  =  b ,  where  A  is  the  m  x  to 
submatrix  of  A  obtained  by  selecting  only  the  chosen  columns,  and  x  £  Rm  is  the 
subvector  of  x  containing  the  in  selected  components.  If  A  is  nonsingular,  then 
we  can  take  x  =  A~lb ,  which  gives  a  feasible  solution  x  with  in  or  less  nonzero 
components.  If  A  is  singular  and  b  ^  71(A),  the  equation  Ax  =  b  is  unsolvable, 
which  means  there  is  no  feasible  x  with  the  chosen  set  of  nonzero  components.  If 
A  is  singular  and  b  £  71(A),  there  is  a  feasible  solution  with  fewer  than  to  nonzero 
components. 

This  approach  can  be  used  to  find  the  smallest  x  with  to  (or  fewer)  nonzero 
entries,  but  in  general  requires  examining  and  comparing  all  n! / (m!(n— to)!)  choices 
of  to.  nonzero  coefficients  of  the  n  coefficients  in  x.  Solving  the  least  l\ -norm 
problem,  on  the  other  hand,  gives  a  good  heuristic  for  finding  a  sparse,  and  small, 
solution  of  Ax  =  b. 


6.3  Regularized  approximation 

6.3.1  Bi-criterion  formulation 

In  the  basic  form  of  regularized  approximation,  the  goal  is  to  find  a  vector  x  that 
is  small  (if  possible),  and  also  makes  the  residual  Ax  —  b  small.  This  is  naturally 
described  as  a  (convex)  vector  optimization  problem  with  two  objectives,  || Ax  —  b\\ 
and  ||x||: 

minimize  (w.r.t.  R+)  (||Ar-6||,W).  (6.7) 


The  two  norms  can  be  different:  the  first,  used  to  measure  the  size  of  the  residual, 
is  on  Rm;  the  second,  used  to  measure  the  size  of  x,  is  on  Rn . 

The  optimal  trade-off  between  the  two  objectives  can  be  found  using  several 
methods.  The  optimal  trade-off  curve  of  \\Ax  —  6||  versus  ||x||,  which  shows  how 
large  one  of  the  objectives  must  be  made  to  have  the  other  one  small,  can  then  be 
plotted.  One  endpoint  of  the  optimal  trade-off  curve  between  \\Ax  —  6||  and  ||x|| 
is  easy  to  describe.  The  minimum  value  of  ||x||  is  zero,  and  is  achieved  only  when 
x  =  0.  For  this  value  of  x,  the  residual  norm  has  the  value  ||6||. 

The  other  endpoint  of  the  trade-off  curve  is  more  complicated  to  describe.  Let 
C  denote  the  set  of  minimizers  of  \\Ax  —  &||  (with  no  constraint  on  ||x||).  Then  any 
minimum  norm  point  in  C  is  Pareto  optimal,  corresponding  to  the  other  endpoint 
of  the  trade-off  curve.  In  other  words,  Pareto  optimal  points  at  this  endpoint  are 
given  by  minimum  norm  minimizers  of  ||Ax  —  6||.  If  both  norms  are  Euclidean,  this 
Pareto  optimal  point  is  unique,  and  given  by  x  =  A^b,  where  A I  is  the  pseudo¬ 
inverse  of  A.  (See  §4.7.6,  page  184,  and  §A.5.4.) 
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6.3.2  Regularization 

Regularization  is  a  common  scalarization  method  used  to  solve  the  bi-criterion 
problem  (6.7).  One  form  of  regularization  is  to  minimize  the  weighted  sum  of  the 
objectives: 

minimize  ||Ac  —  6||  +  7||a;||,  (6-8) 

where  7  >  0  is  a  problem  parameter.  As  7  varies  over  (0,  00),  the  solution  of  (6.8) 
traces  out  the  optimal  trade-off  curve. 

Another  common  method  of  regularization,  especially  when  the  Euclidean  norm 
is  used,  is  to  minimize  the  weighted  sum  of  squared  norms,  i.e., 

minimize  ||  Ax  —  b\\2  +  <5||a:||2,  (6.9) 

for  a  variety  of  values  of  S  >  0. 

These  regularized  approximation  problems  each  solve  the  bi-criterion  problem 
of  making  both  || Ax  —  6||  and  ||x||  small,  by  adding  an  extra  term  or  penalty 
associated  with  the  norm  of  x. 

Interpretations 

Regularization  is  used  in  several  contexts.  In  an  estimation  setting,  the  extra  term 
penalizing  large  ||x||  can  be  interpreted  as  our  prior  knowledge  that  ||x||  is  not  too 
large.  In  an  optimal  design  setting,  the  extra  term  adds  the  cost  of  using  large 
values  of  the  design  variables  to  the  cost  of  missing  the  target  specifications. 

The  constraint  that  ||x||  be  small  can  also  reflect  a  modeling  issue.  It  might  be, 
for  example,  that  y  =  Ax  is  only  a  good  approximation  of  the  true  relationship 
y  =  f(x)  between  x  and  y.  In  order  to  have  f(x)  «  b,  we  want  Ax  s=s  b,  and  also 
need  x  small  in  order  to  ensure  that  f(x)  «  Ax. 

We  will  see  in  §6.4.1  and  §6.4.2  that  regularization  can  be  used  to  take  into 
account  variation  in  the  matrix  A.  Roughly  speaking,  a  large  x  is  one  for  which 
variation  in  A  causes  large  variation  in  Ax,  and  hence  should  be  avoided. 

Regularization  is  also  used  when  the  matrix  A  is  square,  and  the  goal  is  to 
solve  the  linear  equations  Ax  =  b.  In  cases  where  A  is  poorly  conditioned,  or  even 
singular,  regularization  gives  a  compromise  between  solving  the  equations  (i.e., 
making  \\Ax  —  b ||  zero)  and  keeping  x  of  reasonable  size. 

Regularization  comes  up  in  a  statistical  setting;  see  §7.1.2. 

Tikhonov  regularization 

The  most  common  form  of  regularization  is  based  on  (6.9),  with  Euclidean  norms, 
which  results  in  a  (convex)  quadratic  optimization  problem: 

minimize  ||  Ax  —  b\\l  +  S\\x\\\  =  xT(ATA  +  SI)x  —  2br  Ax  +  bTb.  (6.10) 

This  Tikhonov  regularization  problem  has  the  analytical  solution 

x  =  (ATA  +  Siy1ATb. 

Since  AT A  +  SI  >-  0  for  any  S  >  0,  the  Tikhonov  regularized  least-squares  solution 
requires  no  rank  (or  dimension)  assumptions  on  the  matrix  A. 
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Smoothing  regularization 

The  idea  of  regularization,  i.e.,  adding  to  the  objective  a  term  that  penalizes  large 
x,  can  be  extended  in  several  ways.  In  one  useful  extension  we  add  a  regularization 
term  of  the  form  ||Dx||,  in  place  of  ||x||.  In  many  applications,  the  matrix  D 
represents  an  approximate  differentiation  or  second-order  differentiation  operator, 
so  ||_Dx||  represents  a  measure  of  the  variation  or  smoothness  of  x. 

For  example,  suppose  that  the  vector  x  €  Rra  represents  the  value  of  some 
continuous  physical  parameter,  say,  temperature,  along  the  interval  [0,1]:  x*  is 
the  temperature  at  the  point  i/n.  A  simple  approximation  of  the  gradient  or 
first  derivative  of  the  parameter  near  i/n  is  given  by  n(xj+i  —  x*),  and  a  simple 
approximation  of  its  second  derivative  is  given  by  the  second  difference 

n  (n(x»+ 1  -  Xi)  -  n(xi  -  x,_ i))  =  n2(xi+i  -  2 Xj  +  Xj_i). 

If  A  is  the  (tridiagonal,  Toeplitz)  matrix 

1  —2  1  0  •••  0  0  00' 

0  1-2  1  •••  0  0  0  0 

0  0  1  -2  •••  0  0  0  0 

A  =  n2  :  :  :  :  :  :  :  :  e  R("-2)xn, 

0  0  0  0  •••  -2  1  0  0 

0  0  0  0  •••  1-2  10 

_  0  0  0  0  •••  0  1-2  1 

then  Ax  represents  an  approximation  of  the  second  derivative  of  the  parameter,  so 
||  Ax|||  represents  a  measure  of  the  mean-square  curvature  of  the  parameter  over 
the  interval  [0,1]. 

The  Tikhonov  regularized  problem 

minimize  ||Ax  —  &HI  +  (5||Ax||2 

can  be  used  to  trade  off  the  objective  ||Ax  —  6||2,  which  might  represent  a  measure 
of  fit,  or  consistency  with  experimental  data,  and  the  objective  ||Ax||2,  which  is 
(approximately)  the  mean-square  curvature  of  the  underlying  physical  parameter. 
The  parameter  5  is  used  to  control  the  amount  of  regularization  required,  or  to 
plot  the  optimal  trade-off  curve  of  fit  versus  smoothness. 

We  can  also  add  several  regularization  terms.  For  example,  we  can  add  terms 
associated  with  smoothness  and  size,  as  in 

minimize  ||Ax  —  &||2  +  (5||Ax||2  +  ??||x|||. 

Here,  the  parameter  6  >  0  is  used  to  control  the  smoothness  of  the  approximate 
solution,  and  the  parameter  77  >  0  is  used  to  control  its  size. 

Example  6.3  Optimal  input  design.  We  consider  a  dynamical  system  with  scalar 
input  sequence  u(0),  u(l), . . . ,  u(N),  and  scalar  output  sequence  7/(0),  7/(1), . . . ,  y{N), 
related  by  convolution: 

t 

y{t)  =  y~]/t(r)7r(f-r),  t  =  0, 1, . . . ,  N. 

T— 0 
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The  sequence  h(0),  h(  1), . . . ,  h(N)  is  called  the  convolution  kernel  or  impulse  response 
of  the  system. 

Our  goal  is  to  choose  the  input  sequence  u  to  achieve  several  goals. 

•  Output  tracking.  The  primary  goal  is  that  the  output  y  should  track,  or  follow, 
a  desired  target  or  reference  signal  ydes-  We  measure  output  tracking  error  by 
the  quadratic  function 

1  N 

Ttrack  =  jy  1  ^  —  ydes{t ))  . 

t= 0 

•  Small  input.  The  input  should  not  be  large.  We  measure  the  magnitude  of  the 
input  by  the  quadratic  function 

N 

Jmag  =  a TTi 

£=0 

•  Small  input  variations.  The  input  should  not  vary  rapidly.  We  measure  the 
magnitude  of  the  input  variations  by  the  quadratic  function 

N  —  l 

Jder  =  ^  “  M(i))2- 

t= o 

By  minimizing  a  weighted  sum 

Ttrack  T  t^Tder  T  VjJi nag, 

where  5  >  0  and  r/  >  0,  we  can  trade  off  the  three  objectives. 

Now  we  consider  a  specific  example,  with  N  =  200,  and  impulse  response 

h(t)  =  ^(0.9){(1  —  0.4cos(2 1)). 

Figure  6.6  shows  the  optimal  input,  and  corresponding  output  (along  with  the  desired 
trajectory  2/des),  for  three  values  of  the  regularization  parameters  5  and  tj.  The  top 
row  shows  the  optimal  input  and  corresponding  output  for  5  =  0,  r/  =  0.005.  In  this 
case  we  have  some  regularization  for  the  magnitude  of  the  input,  but  no  regularization 
for  its  variation.  While  the  tracking  is  good  ( i.e .,  we  have  Jtrack  is  small),  the  input 
required  is  large,  and  rapidly  varying.  The  second  row  corresponds  to  S  =  0,  p  =  0.05. 
In  this  case  we  have  more  magnitude  regularization,  but  still  no  regularization  for 
variation  in  u.  The  corresponding  input  is  indeed  smaller,  at  the  cost  of  a  larger 
tracking  error.  The  bottom  row  shows  the  results  for  S  =  0.3,  p  =  0.05.  In  this 
case  we  have  added  some  regularization  for  the  variation.  The  input  variation  is 
substantially  reduced,  with  not  much  increase  in  output  tracking  error. 

Iq-norm  regularization 

Regularization  with  an  fq-nornr  can  be  used  as  a  heuristic  for  finding  a  sparse 
solution.  For  example,  consider  the  problem 


minimize  \\Ax  —  6||  2  +  7||x||i, 


(6.11) 
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Figure  6.6  Optimal  inputs  (left)  and  resulting  outputs  (right)  for  three  values 
of  the  regularization  parameters  5  (which  corresponds  to  input  variation)  and 
r /  (which  corresponds  to  input  magnitude).  The  dashed  line  in  the  righthand 
plots  shows  the  desired  output  ydes-  Top  row:  <5  =  0,  r/  =  0.005;  middle  row: 
5  =  0,  r/  =  0.05;  bottom  row:  5  =  0.3,  rj  =  0.05. 
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in  which  the  residual  is  measured  with  the  Euclidean  norm  and  the  regularization  is 
done  with  an  ti-norm.  By  varying  the  parameter  7  we  can  sweep  out  the  optimal 
trade-off  curve  between  || Ax  —  &|| 2  and  ||or||  1 ,  which  serves  as  an  approximation 
of  the  optimal  trade-off  curve  between  || Ax  —  6H2  and  the  sparsity  or  cardinality 
card(a;)  of  the  vector  x,  i.e.,  the  number  of  nonzero  elements.  The  problem  (6.11) 
can  be  recast  and  solved  as  an  SOCP. 


Example  6.4  Regressor  selection  problem.  We  are  given  a  matrix  A  £  Rmxn, 
whose  columns  are  potential  regressors,  and  a  vector  b  £  Rm  that  is  to  be  fit  by  a 
linear  combination  of  k  <  n  columns  of  A.  The  problem  is  to  choose  the  subset  of  k 
regressors  to  be  used,  and  the  associated  coefficients.  We  can  express  this  problem 
as 

minimize  ||  Ax  —  6||  2 

subject  to  card(a:)  <  k. 

In  general,  this  is  a  hard  combinatorial  problem. 

One  straightforward  approach  is  to  check  every  possible  sparsity  pattern  in  x  with  k 
nonzero  entries.  For  a  fixed  sparsity  pattern,  we  can  find  the  optimal  x  by  solving 
a  least-squares  problem,  i.e.,  minimizing  \\Ax  —  6|| 2,  where  A  denotes  the  submatrix 
of  A  obtained  by  keeping  the  columns  corresponding  to  the  sparsity  pattern,  and 
x  is  the  subvector  with  the  nonzero  components  of  x.  This  is  done  for  each  of  the 
n\/(k\{n  —  k )!)  sparsity  patterns  with  k  nonzeros. 

A  good  heuristic  approach  is  to  solve  the  problem  (6.11)  for  different  values  of  7, 
finding  the  smallest  value  of  7  that  results  in  a  solution  with  card(a:)  =  k.  We  then 
fix  this  sparsity  pattern  and  find  the  value  of  x  that  minimizes  ||  Ax  —  &II2 - 

Figure  6.7  illustrates  a  numerical  example  with  A  £  R10x20,  x  £  R20,  b  £  R10.  The 
circles  on  the  dashed  curve  are  the  (globally)  Pareto  optimal  values  for  the  trade-off 
between  card(*)  (vertical  axis)  and  the  residual  || Ax  —  6|| 2  (horizontal  axis).  For 
each  k,  the  Pareto  optimal  point  was  obtained  by  enumerating  all  possible  sparsity 
patterns  with  k  nonzero  entries,  as  described  above.  The  circles  on  the  solid  curve 
were  obtained  with  the  heuristic  approach,  by  using  the  sparsity  patterns  of  the 
solutions  of  problem  (6.11)  for  different  values  of  7.  Note  that  for  card(rr)  =  1,  the 
heuristic  method  actually  finds  the  global  optimum. 

This  idea  will  come  up  again  in  basis  pursuit  (§6.5.4). 


6.3.3  Reconstruction,  smoothing,  and  de-noising 

In  this  section  we  describe  an  important  special  case  of  the  bi-criterion  approxi¬ 
mation  problem  described  above,  and  give  some  examples  showing  how  different 
regularization  methods  perform.  In  reconstruction  problems ,  we  start  with  a  signal 
represented  by  a  vector  x  £  R™.  The  coefficients  27  correspond  to  the  value  of 
some  function  of  time,  evaluated  (or  sampled ,  in  the  language  of  signal  processing) 
at  evenly  spaced  points.  It  is  usually  assumed  that  the  signal  does  not  vary  too 
rapidly,  which  means  that  usually,  we  have  27  ss  27+1.  (In  this  section  we  consider 
signals  in  one  dimension,  e.g.,  audio  signals,  but  the  same  ideas  can  be  applied  to 
signals  in  two  or  more  dimensions,  e.g.,  images  or  video.) 
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Figure  6.7  Sparse  regressor  selection  with  a  matrix  A  €  R10x20.  The  circles 
on  the  dashed  line  are  the  Pareto  optimal  values  for  the  trade-off  between 
the  residual  \\Ax  —  &H2  and  the  number  of  nonzero  elements  card(*).  The 
points  indicated  by  circles  on  the  solid  line  are  obtained  via  the  £i-norm 
regularized  heuristic. 


The  signal  x  is  corrupted  by  an  additive  noise  v: 

xCOT  =  x  +  v. 

The  noise  can  be  modeled  in  many  different  ways,  but  here  we  simply  assume  that 
it  is  unknown,  small,  and,  unlike  the  signal,  rapidly  varying.  The  goal  is  to  form  an 
estimate  x  of  the  original  signal  x,  given  the  corrupted  signal  xCOI.  This  process  is 
called  signal  reconstruction  (since  we  are  trying  to  reconstruct  the  original  signal 
from  the  corrupted  version)  or  de-noising  (since  we  are  trying  to  remove  the  noise 
from  the  corrupted  signal).  Most  reconstruction  methods  end  up  performing  some 
sort  of  smoothing  operation  on  xcor  to  produce  x.  so  the  process  is  also  called 
smoothing. 

One  simple  formulation  of  the  reconstruction  problem  is  the  bi-criterion  problem 

minimize  (w.r.t.  R+)  (llz-Zcorlk,  <£(£)),  (6.12) 

where  x  is  the  variable  and  xCOI  is  a  problem  parameter.  The  function  <f>  :  R”  — >  R 
is  convex,  and  is  called  the  regularization  function  or  smoothing  objective.  It  is 
meant  to  measure  the  roughness,  or  lack  of  smoothness,  of  the  estimate  x.  The 
reconstruction  problem  (6.12)  seeks  signals  that  are  close  (in  t^-nonn)  to  the  cor¬ 
rupted  signal,  and  that  are  smooth,  i.e.,  for  which  <f>(x)  is  small.  The  reconstruction 
problem  (6.12)  is  a  convex  bi-criterion  problem.  We  can  find  the  Pareto  optimal 
points  by  scalarization,  and  solving  a  (scalar)  convex  optimization  problem. 
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Quadratic  smoothing 

The  simplest  reconstruction  method  uses  the  quadratic  smoothing  function 

n—  1 

0quad (*)  =  ^  ^(*»+l  Xi)  1 1 -Da? 1 1 2 5 
i—1 

where  D  £  R("_1)xn  is  the  bidiagonal  matrix 

'  -1  1  0  •••  0  0  0 

0  -1  1  •••  0  0  0 

D=  :  :  :  :  :  : 

0  0  0  •••  -1  10 

0  0  0  •••  0-11 

We  can  obtain  the  optimal  trade-off  between  ||x  — xcor||2  and  ||.Dx||2  by  minimizing 

II*  —  ^cor||2  + 

where  <5  >  0  parametrizes  the  optimal  trade-off  curve.  The  solution  of  this  quadratic 
problem, 

x=(I  +  5DT  D)~1xCOI, 

can  be  computed  very  efficiently  since  /  +  6DT D  is  tridiagonal;  see  appendix  C. 

Quadratic  smoothing  example 

Figure  6.8  shows  a  signal  x  £  R4000  (top)  and  the  corrupted  signal  xcor  (bottom). 
The  optimal  trade-off  curve  between  the  objectives  ||x  —  a:cor 1 1 2  and  || Z)ai|| 2  is  shown 
in  figure  6.9.  The  extreme  point  on  the  left  of  the  trade-off  curve  corresponds  to 
x  =  xcor,  and  has  objective  value  1 1 Zf>cccor 1 1 2  =  4.4.  The  extreme  point  on  the  right 
corresponds  to  x  =  0,  for  which  ||x  —  arcor 1 1 2  =  ||*Cor||2  =  16.2.  Note  the  clear  knee 
in  the  trade-off  curve  near  ||x  —  xcor||2  ~  3. 

Figure  6.10  shows  three  smoothed  signals  on  the  optimal  trade-off  curve,  cor¬ 
responding  to  ||x  —  xcor 1 1 2  =  8  (top),  3  (middle),  and  1  (bottom).  Comparing  the 
reconstructed  signals  with  the  original  signal  x,  we  see  that  the  best  reconstruction 
is  obtained  for  ||x  —  xCOr||2  =  3,  which  corresponds  to  the  knee  of  the  trade-off 
curve.  For  higher  values  of  ||x  —  *Cor||2,  there  is  too  much  smoothing;  for  smaller 
values  there  is  too  little  smoothing. 

Total  variation  reconstruction 

Simple  quadratic  smoothing  works  well  as  a  reconstruction  method  when  the  orig¬ 
inal  signal  is  very  smooth,  and  the  noise  is  rapidly  varying.  But  any  rapid  varia¬ 
tions  in  the  original  signal  will,  obviously,  be  attenuated  or  removed  by  quadratic 
smoothing.  In  this  section  we  describe  a  reconstruction  method  that  can  remove 
much  of  the  noise,  while  still  preserving  occasional  rapid  variations  in  the  original 
signal.  The  method  is  based  on  the  smoothing  function 

n— 1 

0 tv(*)  =  I*i+1  -  Xi\  =  ||Dx||i, 

i=  1 


314 


6  Approximation  and  fitting 


Figure  6.10  Three  smoothed  or  reconstructed  signals  x.  The  top  one  cor¬ 
responds  to  ||x  —  Xcor || 2  =  8,  the  middle  one  to  ||*  —  *cor||2  =  3,  and  the 
bottom  one  to  ||x  —  xcor||2  =  1. 


which  is  called  the  total  variation  oi  x  €  R".  Like  the  quadratic  smoothness 
measure  ^quad,  the  total  variation  function  assigns  large  values  to  rapidly  varying 
x.  The  total  variation  measure,  however,  assigns  relatively  less  penalty  to  large 
values  of  |xi+i  —  Xi\. 

Total  variation  reconstruction  example 

Figure  6.11  shows  a  signal  x  £  R2000  (in  the  top  plot),  and  the  signal  corrupted 
with  noise  cccor.  The  signal  is  mostly  smooth,  but  has  several  rapid  variations  or 
jumps  in  value;  the  noise  is  rapidly  varying. 

We  first  use  ciuadratic  smoothing.  Figure  6.12  shows  three  smoothed  signals  on 
the  optimal  trade-off  curve  between  || Z?czr|| 2  and  ||x  —  aJcor  1 1 2 -  In  the  first  two  signals, 
the  rapid  variations  in  the  original  signal  are  also  smoothed.  In  the  third  signal 
the  steep  edges  in  the  signal  are  better  preserved,  but  there  is  still  a  significant 
amount  of  noise  left. 

Now  we  demonstrate  total  variation  reconstruction.  Figure  6.13  shows  the 
optimal  trade-off  curve  between  ||Dx||i  and  ||x  —  a?Corr  1 1 2  -  Figure  6.14  shows  the  re¬ 
constructed  signals  on  the  optimal  trade-off  curve,  for  ||Dx||i  =  5  (top),  ||-Dx||i  =  8 
(middle),  and  ||-Dx||i  =  10  (bottom).  We  observe  that,  unlike  quadratic  smoothing, 
total  variation  reconstruction  preserves  the  sharp  transitions  in  the  signal. 
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Figure  6.11  A  signal  x  G  R2000,  and  the  corrupted  signal  xCOi  G  R2000.  The 
noise  is  rapidly  varying,  and  the  signal  is  mostly  smooth,  with  a  few  rapid 
variations. 


316 


6  Approximation  and  fitting 


Figure  6.12  Three  quadratically  smoothed  signals  x.  The  top  one  corre¬ 
sponds  to  ||x  —  £Cor||2  =  10,  the  middle  one  to  ||*  —  *Cor||2  =  7,  and  the 
bottom  one  to  ||*  —  *Cor||2  =  4.  The  top  one  greatly  reduces  the  noise,  but 
also  excessively  smooths  out  the  rapid  variations  in  the  signal.  The  bottom 
smoothed  signal  does  not  give  enough  noise  reduction,  and  still  smooths  out 
the  rapid  variations  in  the  original  signal.  The  middle  smoothed  signal  gives 
the  best  compromise,  but  still  smooths  out  the  rapid  variations. 


0  10  20  30  40  50 


P  -  Zcor||2 

Figure  6.13  Optimal  trade-off  curve  between  ||Dx||i  and  ||x  —  £cor||2. 
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Figure  6.14  Three  reconstructed  signals  x.  using  total  variation  reconstruc¬ 
tion.  The  top  one  corresponds  to  ||Da;||i  =  5,  the  middle  one  to  ||Da:||i  =  8, 
and  the  bottom  one  to  || -Da; ||i  =  10.  The  bottom  one  does  not  give  quite 
enough  noise  reduction,  while  the  top  one  eliminates  some  of  the  slowly  vary¬ 
ing  parts  of  the  signal.  Note  that  in  total  variation  reconstruction,  unlike 
quadratic  smoothing,  the  sharp  changes  in  the  signal  are  preserved. 
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6.4  Robust  approximation 

6.4.1  Stochastic  robust  approximation 

We  consider  an  approximation  problem  with  basic  objective  ||  Ar  —  6||,  but  also  wish 
to  take  into  account  some  uncertainty  or  possible  variation  in  the  data  matrix  A. 
(The  same  ideas  can  be  extended  to  handle  the  case  where  there  is  uncertainty  in 
both  A  and  b.)  In  this  section  we  consider  some  statistical  models  for  the  variation 
in  A. 

We  assume  that  A  is  a  random  variable  taking  values  in  Rmxn,  with  mean  A, 
so  we  can  describe  A  as 

A  =  A  +  U, 

where  U  is  a  random  matrix  with  zero  mean.  Here,  the  constant  matrix  A  gives 
the  average  value  of  A,  and  U  describes  its  statistical  variation. 

It  is  natural  to  use  the  expected  value  of  || Ax  —  6||  as  the  objective: 

minimize  E||Ar  — 6||.  (6.13) 

We  refer  to  this  problem  as  the  stochastic  robust  approximation  problem.  It  is 
always  a  convex  optimization  problem,  but  usually  not  tractable  since  in  most 
cases  it  is  very  difficult  to  evaluate  the  objective  or  its  derivatives. 

One  simple  case  in  which  the  stochastic  robust  approximation  problem  (6.13) 
can  be  solved  occurs  when  A  assumes  only  a  finite  number  of  values,  i.e., 

prob  (A  =  Ai)=pi,  i  =  l,...,k, 

where  At  £  Rm x n ,  1  Tp  =  1,  p  >;  0.  In  this  case  the  problem  (6.13)  has  the  form 
minimize  pi\\Aix  —  6||  +  •  •  •  +  pkWA^x  —  6||, 
which  is  often  called  a  sum-of-norms  problem.  It  can  be  expressed  as 
minimize  pTt 

subject  to  || AiX  —  6||  <  £*,  i  =  1, . ... ,  k, 

where  the  variables  are  x  £  R™  and  t  £  Rfe.  If  the  norm  is  the  Euclidean  norm, 
this  sum-of-norms  problem  is  an  SOCP.  If  the  norm  is  the  l\-  or  t^-norrn.  the 
sum-of-norms  problem  can  be  expressed  as  an  LP;  see  exercise  6.8. 

Some  variations  on  the  statistical  robust  approximation  problem  (6.13)  are 
tractable.  As  an  example,  consider  the  statistical  robust  least-squares  problem 

minimize  E  1 1  Ax  —  b  \  \  \ , 

where  the  norm  is  the  Euclidean  norm.  We  can  express  the  objective  as 

=  E(Ax  —  b  +  Ux)T(Ax  —  b  +  Ux) 

=  (Ax  —  b)T  (Ax  —  b) +  'ExTUTUx 
=  \\Ax  —  b\\l  +  xT  Px, 


E  \\Ax  —  b\\l 
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where  P  =  E  UTU.  Therefore  the  statistical  robust  approximation  problem  has 
the  form  of  a  regularized  least-squares  problem 

minimize  \\Ax  —  6|||  + 

with  solution 

x  =  ( ATA  +  P)~1ATb . 

This  makes  perfect  sense:  when  the  matrix  A  is  subject  to  variation,  the  vector 
Ax  will  have  more  variation  the  larger  x  is,  and  Jensen’s  inequality  tells  us  that 
variation  in  Ax  will  increase  the  average  value  of  ||  Ax  —  b ||2-  So  we  need  to  balance 
making  Ax  —  b  small  with  the  desire  for  a  small  x  (to  keep  the  variation  in  Ax 
small),  which  is  the  essential  idea  of  regularization. 

This  observation  gives  us  another  interpretation  of  the  Tikhonov  regularized 
least-squares  problem  (6.10),  as  a  robust  least-squares  problem,  taking  into  account 
possible  variation  in  the  matrix  A.  The  solution  of  the  Tikhonov  regularized  least- 
squares  problem  (6.10)  minimizes  E||(zl  +  U)x  —  &||2,  where  Uij  are  zero  mean, 
uncorrelated  random  variables,  with  variance  S/m  (and  here,  A  is  deterministic). 


6.4.2  Worst-case  robust  approximation 

It  is  also  possible  to  model  the  variation  in  the  matrix  A  using  a  set-based,  worst- 
case  approach.  We  describe  the  uncertainty  by  a  set  of  possible  values  for  A: 

AeAC  Rmx", 

which  we  assume  is  nonempty  and  bounded.  We  define  the  associated  worst-case 
error  of  a  candidate  approximate  solution  x  €  R"  as 

ewc{x)  =  sup{||  Ac  —  6||  |  A  e  A}, 

which  is  always  a  convex  function  of  x.  The  (worst-case)  robust  approximation 
problem  is  to  minimize  the  worst-case  error: 

minimize  ewc(x)  =  sup{||Ac  —  6||  |  A  €  A},  (6-14) 

where  the  variable  is  x,  and  the  problem  data  are  b  and  the  set  A.  When  A  is  the 
singleton  A  =  {A},  the  robust  approximation  problem  (6.14)  reduces  to  the  basic 
norm  approximation  problem  (6.1).  The  robust  approximation  problem  is  always 
a  convex  optimization  problem,  but  its  tractability  depends  on  the  norm  used  and 
the  description  of  the  uncertainty  set  A. 


Example  6.5  Comparison  of  stochastic  and  worst-case  robust  approximation.  To 
illustrate  the  difference  between  the  stochastic  and  worst-case  formulations  of  the 
robust  approximation  problem,  we  consider  the  least-squares  problem 

minimize  ||j4(u)a:  —  b  Hi, 

where  u  £  R  is  an  uncertain  parameter  and  A(u )  =  To  +  uAi.  We  consider  a 
specific  instance  of  the  problem,  with  A(u)  £  R20x10,  ||To||  =  10,  ||ffi||  =  1,  and  u 
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Figure  6.15  The  residual  r(u)  =  ||A(m)x  —  b ||2  as  a  function  of  the  un¬ 
certain  parameter  u  for  three  approximate  solutions  x:  (1)  the  nominal 
least-squares  solution  *nom;  (2)  the  solution  of  the  stochastic  robust  approx¬ 
imation  problem  a;stoch  (assuming  u  is  uniformly  distributed  on  [—1,1]);  and 
(3)  the  solution  of  the  worst-case  robust  approximation  problem  a:wc,  as¬ 
suming  the  parameter  u  lies  in  the  interval  [—1,1].  The  nominal  solution 
achieves  the  smallest  residual  when  u  =  0,  but  gives  much  larger  residuals 
as  u  approaches  —1  or  1.  The  worst-case  solution  has  a  larger  residual  when 
u  —  0,  but  its  residuals  do  not  rise  much  as  the  parameter  u  varies  over  the 
interval  [—1, 1]. 


in  the  interval  [—1, 1].  (So,  roughly  speaking,  the  variation  in  the  matrix  A  is  around 

±10%.) 

We  find  three  approximate  solutions: 

•  Nominal  optimal.  The  optimal  solution  in0m  is  found,  assuming  A{u)  has  its 
nominal  value  Ao- 

•  Stochastic  robust  approximation.  We  find  xstoch,  which  minimizes  ~Ei\\A(u)x  — 
bill,  assuming  the  parameter  u  is  uniformly  distributed  on  [—1, 1], 

•  Worst-case  robust  approximation.  We  find  £wc,  which  minimizes 

sup  || A{u)x  -  b||2  =  max{||(%0  -  A^x  -  b||2,  ||(A0  ±  A±)x  -  b||2}. 

For  each  of  these  three  values  of  x,  we  plot  the  residual  r(u)  =  \\A(u)x  —  b ||2  as  a 
function  of  the  uncertain  parameter  u ,  in  figure  6.15.  These  plots  show  how  sensitive 
an  approximate  solution  can  be  to  variation  in  the  parameter  u.  The  nominal  solu¬ 
tion  achieves  the  smallest  residual  when  u  =  0,  but  is  quite  sensitive  to  parameter 
variation:  it  gives  much  larger  residuals  as  u  deviates  from  0,  and  approaches  —1  or 
1.  The  worst-case  solution  has  a  larger  residual  when  u  =  0,  but  its  residuals  do  not 
rise  much  as  u  varies  over  the  interval  [—1, 1].  The  stochastic  robust  approximate 
solution  is  in  between. 


6.4  Robust  approximation 
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The  robust  approximation  problem  (6.14)  arises  in  many  contexts  and  applica¬ 
tions.  In  an  estimation  setting,  the  set  A  gives  our  uncertainty  in  the  linear  relation 
between  the  vector  to  be  estimated  and  our  measurement  vector.  Sometimes  the 
noise  term  v  in  the  model  y  =  Ax  +  v  is  called  additive  noise  or  additive  error, 
since  it  is  added  to  the  ‘ideal’  measurement  Ax.  In  contrast,  the  variation  in  A  is 
called  multiplicative  error,  since  it  multiplies  the  variable  x. 

In  an  optimal  design  setting,  the  variation  can  represent  uncertainty  (arising  in 
manufacture,  say)  of  the  linear  equations  that  relate  the  design  variables  x  to  the 
results  vector  Ax.  The  robust  approximation  problem  (6.14)  is  then  interpreted  as 
the  robust  design  problem:  find  design  variables  x  that  minimize  the  worst  possible 
mismatch  between  Ax  and  b,  over  all  possible  values  of  A. 

Finite  set 

Here  we  have  A  =  {Ai, . . . ,  Ak},  and  the  robust  approximation  problem  is 
minimize  maxj=ir..^  \\AiX  —  6||. 

This  problem  is  equivalent  to  the  robust  approximation  problem  with  the  polyhe¬ 
dral  set  A  =  conv{Ai, . . . ,  Ak}. 

minimize  sup  {\\Ax  —  6||  |  A  €  conv{Ai, . . . ,  Ak}}  ■ 

We  can  cast  the  problem  in  epigraph  form  as 

minimize  t 

subject  to  ||  AiX  —  &||  <  t,  i  =  1, . . . ,  k, 

which  can  be  solved  in  a  variety  of  ways,  depending  on  the  norm  used.  If  the  norm 
is  the  Euclidean  norm,  this  is  an  SOCP.  If  the  norm  is  the  t\ -  or  foo-nornr,  we  can 
express  it  as  an  LP. 

Norm  bound  error 

Here  the  uncertainty  set  A  is  a  norm  ball,  A  =  {A  +  U  |  ||17||  <  a},  where  ||  -  ||  is  a 
norm  on  Rmxn.  In  this  case  we  have 

ewc(a;)  =  sup{||Ar  -  b+  Ux\\  |  ||I/||  <  a}, 

which  must  be  carefully  interpreted  since  the  first  norm  appearing  is  on  Rm  (and 
is  used  to  measure  the  size  of  the  residual)  and  the  second  one  appearing  is  on 
Rmxn  (used  to  define  the  norm  ball  .4). 

This  expression  for  ewc(x )  can  be  simplified  in  several  cases.  As  an  example, 
let  us  take  the  Euclidean  norm  on  R"  and  the  associated  induced  norm  on  Rmx”, 
i.e.,  the  maximum  singular  value.  If  Ax  —  6^0  and  x  ^  0,  the  supremum  in  the 
expression  for  ewc(x)  is  attained  for  U  =  auvT ,  with 

Ax  —  b  x 

U=  \\Ax-bh' 

and  the  resulting  worst-case  error  is 

eWc(x)  =  ||  Ar  -  &||2  +  a||cc||2- 
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(It  is  easily  verified  that  this  expression  is  also  valid  if  x  or  Ax  —  b  is  zero.)  The 
robust  approximation  problem  (6.14)  then  becomes 

minimize  \\Ax  —  6||  2  +  o||a:||2> 

which  is  a  regularized  norm  problem,  solvable  as  the  SOCP 

minimize  t\  +  at  2 

subject  to  \\Ax  —  b\\2  <ti,  ||cc||2  <  O- 

Since  the  solution  of  this  problem  is  the  same  as  the  solution  of  the  regularized 
least-squares  problem 

minimize  \\Ax  —  b  |||  +  J||rr||| 

for  some  value  of  the  regularization  parameter  6,  we  have  another  interpretation  of 
the  regularized  least-squares  problem  as  a  worst-case  robust  approximation  prob¬ 
lem. 


Uncertainty  ellipsoids 


We  can  also  describe  the  variation  in  A  by  giving  an  ellipsoid  of  possible  values  for 
each  row: 

■A  { [ui  *  um]  |  cq  £  Si ,  i  1 , . . . ,  in}  , 


where 

Si  =  {<q  +  Pfu  |  ||u||2  <  1}. 

The  matrix  Pi  £  R'ixn  describes  the  variation  in  a,.  We  allow  Pi  to  have  a  nontriv¬ 
ial  nullspace,  in  order  to  model  the  situation  when  the  variation  in  a,  is  restricted 
to  a  subspace.  As  an  extreme  case,  we  take  T)  =  0  if  there  is  no  uncertainty  in  a,. 

With  this  ellipsoidal  uncertainty  description,  we  can  give  an  explicit  expression 
for  the  worst-case  magnitude  of  each  residual: 


sup  \af x  —  bi\ 

CLi^iSi 


sup{|af x  —  bi  +  {Piu)T x |  |  ||u||2  <  1} 
I  ajx-btl  +  \\PiX\\2. 


Using  this  result  we  can  solve  several  robust  approximation  problems.  For 
example,  the  robust  £2-norm  approximation  problem 


minimize  ewc(x)  =  sup{|| Ax  —  &||2  |  cq  £  Si,  i  =  1, . . . ,  in} 


can  be  reduced  to  an  SOCP,  as  follows.  An  explicit  expression  for  the  worst-case 
error  is  given  by 

!/2  /  m  \  !/2 


To  minimize  ewc(a;)  we  can  solve 

minimize  ||t||2 
subject  to  \af x  —  bi\  +  \\Pf x\\2  <  U,  i  =  l,...,m, 


aJx-bi 

i= 1 


+  ll^l|2): 
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where  we  introduced  new  variables  t\, . . .  ,tm.  This  problem  can  be  formulated  as 
minimize  ||t||2 

subject  to  afx  —  bi  +  1 1 a? 1 1 2  <  U,  i  =  1, . . .  ,m 

-afx  +  bi  +  \\P^x\\2  <U ,  i=l,...,m, 


which  becomes  an  SOCP  when  put  in  epigraph  form. 

Norm  bounded  error  with  linear  structure 

As  a  generalization  of  the  norm  bound  description  A  =  {A  +  U  |  ||£/||  <  a},  we  can 
define  A  as  the  image  of  a  norm  ball  under  an  affine  transformation: 


A  —  { A  +  u\A\  +  U2A2  +  •  •  •  +  UpAp  |  ||w||  <  1}, 

where  ||  •  ||  is  a  norm  on  Rp,  and  the  p  +  1  matrices  A,  A±, . . . ,  Ap  £  Rmx"  are 
given.  The  worst-case  error  can  be  expressed  as 

ewc(x)  =  sup  ||(A  +  wiAiH - +  upAp)x-b\\ 

!MI<i 

=  sup  ||P(x)u  +  g(a:)||, 
ll«||<i 

where  P  and  q  are  defined  as 

P(x)=[AlX  A2x  ■■■  Apx]e  Rmxp,  q{x)  =  Ax -be  Rm. 

As  a  first  example,  we  consider  the  robust  Chebyshev  approximation  problem 

minimize  ewc(x)  =  sup^u^!  ||(A  +  mAi  -\ - b  upAp)x  -  &||oo- 

In  this  case  we  can  derive  an  explicit  expression  for  the  worst-case  error.  Let  pi(x)T 
denote  the  ith  row  of  P{x).  We  have 

ewc{x)  =  sup  ||P(x)M  +  g(x)||00 
IMI=o<i 

=  max  sup  \pi(x)Tu  +  qi(x)\ 

i=1>- ||ii|U<i 

=  .max  (||pi(x)||i  +  |gi(x)|). 

i=\ 

The  robust  Chebyshev  approximation  problem  can  therefore  be  cast  as  an  LP 
minimize  t 

subject  to  —yo  A  Ax  —  b  A  y0 

—Vk  A  Akx  A  yfc,  k  =  l,...,p 
Vo  +  Efc=i2/fe  ^  t1, 

with  variables  x  £  R",  yk  £  Rm,  t  £  R. 

As  another  example,  we  consider  the  robust  least-squares  problem 

minimize  ewc(x)  =  supi^n^!  ||(A  + wiAi  H - b  upAp)x  -  6||2. 
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Here  we  use  Lagrange  duality  to  evaluate  ewc.  The  worst-case  error  ewc(a;)  is  the 
squareroot  of  the  optimal  value  of  the  (nonconvex)  quadratic  optimization  problem 

maximize  ||P(x)u  +  rz(ic)  ||i 
subject  to  uTu  <  1, 


with  u  as  variable. 
SDP 


The  Lagrange  dual  of  this  problem  can  be  expressed  as  the 


minimize 
subject  to 


t  T  A 


P(x)T 

q(x)T 


P{x) 

XI 

0 


q(x) 

0 

t 


>Z  0 


(6.15) 


with  variables  t,  A  G  R.  Moreover,  as  mentioned  in  §5.2  and  §B.l  (and  proved 
in  §B.4),  strong  duality  holds  for  this  pair  of  primal  and  dual  problems.  In  other 
words,  for  fixed  x,  we  can  compute  ewc(x)2  by  solving  the  SDP  (6.15)  with  variables 
t  and  A.  Optimizing  jointly  over  t,  A,  and  x  is  equivalent  to  minimizing  ewc(x)2. 
We  conclude  that  the  robust  least-squares  problem  is  equivalent  to  the  SDP  (6.15) 
with  x,  A,  t  as  variables. 


Example  6.6  Comparison  of  worst-case  robust,  Tikhonov  regularized,  and  nominal 
least-squares  solutions.  We  consider  an  instance  of  the  robust  approximation  problem 

minimize  supu^i,^  || (^4  +  ui%i  +  u2A2)x  -  b\\2,  (6.16) 

with  dimensions  m  =  50,  n  =  20.  The  matrix  A  has  norm  10,  and  the  two  matrices 
Ai  and  A2  have  norm  1,  so  the  variation  in  the  matrix  A  is,  roughly  speaking,  around 
10%.  The  uncertainty  parameters  in  and  U2  lie  in  the  unit  disk  in  R2. 

We  compute  the  optimal  solution  of  the  robust  least-squares  problem  (6.16)  xris,  as 
well  as  the  solution  of  the  nominal  least-squares  problem  a?is  ( i.e .,  assuming  u  =  0), 
and  also  the  Tikhonov  regularized  solution  Xtik,  with  <5  =  1. 

To  illustrate  the  sensitivity  of  each  of  these  approximate  solutions  to  the  parameter 
u,  we  generate  105  parameter  vectors,  uniformly  distributed  on  the  unit  disk,  and 
evaluate  the  residual 

||  {A0  +  U1A1  +  U2A2)x  -  &||2 

for  each  parameter  value.  The  distributions  of  the  residuals  are  shown  in  figure  6.16. 

We  can  make  several  observations.  First,  the  residuals  of  the  nominal  least-squares 
solution  are  widely  spread,  from  a  smallest  value  around  0.52  to  a  largest  value 
around  4.9.  In  particular,  the  least-squares  solution  is  very  sensitive  to  parameter 
variation.  In  contrast,  both  the  robust  least-squares  and  Tikhonov  regularized  so¬ 
lutions  exhibit  far  smaller  variation  in  residual  as  the  uncertainty  parameter  varies 
over  the  unit  disk.  The  robust  least-squares  solution,  for  example,  achieves  a  residual 
between  2.0  and  2.6  for  all  parameters  in  the  unit  disk. 


6.5  Function  fitting  and  interpolation 

In  function  fitting  problems,  we  select  a  member  of  a  finite-dimensional  subspace 
of  functions  that  best  fits  some  given  data  or  requirements.  For  simplicity  we 
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Figure  6.16  Distribution  of  the  residuals  for  the  three  solutions  of  a  least- 
squares  problem  (6.16):  xis,  the  least-squares  solution  assuming  u  =  0;  *tik, 
the  Tikhonov  regularized  solution  with  5  =  1;  and  xris,  the  robust  least- 
squares  solution.  The  histograms  were  obtained  by  generating  10s  values  of 
the  uncertain  parameter  vector  u  from  a  uniform  distribution  on  the  unit 
disk  in  R2.  The  bins  have  width  0.1. 


326 


6  Approximation  and  fitting 


consider  real-valued  functions;  the  ideas  are  readily  extended  to  handle  vector- 
valued  functions  as  well. 


6.5.1  Function  families 

We  consider  a  family  of  functions  f\, ,  fn  :  Rfc  — >  R,  with  common  domain 
dorn  /,  =  D.  With  each  x  £  R™  we  associate  the  function  /  :  Rfc  — >  R  given  by 

/(«)  =  Xifx{u)  -\ - 1-  Xnfn(u)  (6.17) 

with  dom/  =  D.  The  family  {fi,  ■  ■  ■ ,  fn}  is  sometimes  called  the  set  of  basis 
functions  (for  the  fitting  problem)  even  when  the  functions  are  not  independent. 
The  vector  x  £  R™,  which  parametrizes  the  subspace  of  functions,  is  our  optimiza¬ 
tion  variable,  and  is  sometimes  called  the  coefficient  vector.  The  basis  functions 
generate  a  subspace  T  of  functions  on  D. 

In  many  applications  the  basis  functions  are  specially  chosen,  using  prior  knowl¬ 
edge  or  experience,  in  order  to  reasonably  model  functions  of  interest  with  the 
finite-dimensional  subspace  of  functions.  In  other  cases,  more  generic  function 
families  are  used.  We  describe  a  few  of  these  below. 


Polynomials 


One  common  subspace  of  functions  on  R  consists  of  polynomials  of  degree  less 
than  n.  The  simplest  basis  consists  of  the  powers,  i.e.,  fi(t )  =  f*_1,  i  =  1, . . .  ,n. 
In  many  applications,  the  same  subspace  is  described  using  a  different  basis,  for 
example,  a  set  of  polynomials  /i, . . . ,  fni  of  degree  less  than  n,  that  are  orthonormal 
with  respect  to  some  positive  function  (or  measure)  4>  :  R™  — >  R+,  *.e., 


J  dt  =  |  * 

Another  common  basis  for  polynomials  is  the  Lagrange  basis  f\ , . . . ,  fn  associated 
with  distinct  points  t\, . . . ,  tn,  which  satisfy 


We  can  also  consider  polynomials  on  Rfc,  with  a  maximum  total  degree,  or  a 
maximum  degree  for  each  variable. 

As  a  related  example,  we  have  trigonometric  polynomials  of  degree  less  than  n, 
with  basis 


sin  kt.  k  =  l,...,n—l,  cos  kt,  k  =  0, . . . ,  n  —  1. 

Piecewise-linear  functions 

We  start  with  a  triangularization  of  the  domain  D ,  which  means  the  following.  We 
have  a  set  of  mesh  or  grid  points  gi , . . . ,  gn  G  Rfc ,  and  a  partition  of  D  into  a  set 
of  simplexes: 


D  =  Si  U  •  ■  •  U  Sm, 


int(S',;  fl  Sj)  =  0  for  i  ^  j. 
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Figure  6.17  A  piecewise-linear  function  of  two  variables,  on  the  unit  square. 
The  triangulation  consists  of  98  simplexes,  and  a  uniform  grid  of  64  points 
in  the  unit  square. 


Each  simplex  is  the  convex  hull  of  k  +  1  grid  points,  and  we  require  that  each  grid 
point  is  a  vertex  of  any  simplex  it  lies  in. 

Given  a  triangularization,  we  can  construct  a  piecewise-linear  (or  more  precisely, 
piecewise-affine)  function  /  by  assigning  function  values  f(gi)  =  Xi  to  the  grid 
points,  and  then  extending  the  function  affinely  on  each  simplex.  The  function  / 
can  be  expressed  as  (6.17)  where  the  basis  functions  /)  are  affine  on  each  simplex 
and  are  defined  by  the  conditions 


fi{9j)  ~  {  0  1*1 

By  construction,  such  a  function  is  continuous. 

Figure  6.17  shows  an  example  for  k  =  2. 

Piecewise  polynomials  and  splines 

The  idea  of  piecewise-affine  functions  on  a  triangulated  domain  is  readily  extended 
to  piecewise  polynomials  and  other  functions. 

Piecewise  polynomials  are  defined  as  polynomials  (of  some  maximum  degree) 
on  each  simplex  of  the  triangulation,  which  are  continuous,  i.e.,  the  polynomials 
agree  at  the  boundaries  between  simplexes.  By  further  restricting  the  piecewise 
polynomials  to  have  continuous  derivatives  up  to  a  certain  order,  we  can  define 
various  classes  of  spline  functions.  Figure  6.18  shows  an  example  of  a  cubic  spline, 
i.e.,  a  piecewise  polynomial  of  degree  3  on  R,  with  continuous  first  and  second 
derivatives. 
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Figure  6.18  Cubic  spline.  A  cubic  spline  is  a  piecewise  polynomial,  with 
continuous  first  and  second  derivatives.  In  this  example,  the  cubic  spline  / 
is  formed  from  the  three  cubic  polynomials  p\  (on  [uo,«i]),  P2  (on  [ui,«2]), 
and  p3  (on  [v,2,U3]).  Adjacent  polynomials  have  the  same  function  value, 
and  equal  first  and  second  derivatives,  at  the  boundary  points  u\  and  U2- 
In  this  example,  the  dimension  of  the  family  of  functions  is  n  =  6,  since 
we  have  12  polynomial  coefficients  (4  per  cubic  polynomial),  and  6  equality 
constraints  (3  each  at  u\  and  112). 
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Constraints 

In  this  section  we  describe  some  constraints  that  can  be  imposed  on  the  function 
/,  and  therefore,  on  the  variable  x  £  R". 

Function  value  interpolation  and  inequalities 

Let  v  be  a  point  in  D.  The  value  of  /  at  v, 

n 

f(v)  =  ^2xifi(v), 

i= 1 

is  a  linear  function  of  x.  Therefore  interpolation  conditions 

f{vo)  =  zji  3  = 

which  require  the  function  /  to  have  the  values  Zj  £  R  at  specified  points  Vj  £  D, 
form  a  set  of  linear  equalities  in  x.  More  generally,  inequalities  on  the  function 
value  at  a  given  point,  as  in  l  <  f(y)  <  u,  are  linear  inequalities  on  the  variable  x. 
There  are  many  other  interesting  convex  constraints  on  /  (hence,  x )  that  involve 
the  function  values  at  a  finite  set  of  points  Vi, ... ,  Vjy.  For  example,  the  Lipschitz 
constraint 

\f(vj)  ~  f(vk)\  <  L\\vj  ~  vk\\,  j,  k  =  1, . . . ,  m, 
forms  a  set  of  linear  inequalities  in  x. 

We  can  also  impose  inequalities  on  the  function  values  at  an  infinite  number  of 
points.  As  an  example,  consider  the  nonnegativity  constraint 

f{u )  >  0  for  all  u  £  D. 

This  is  a  convex  constraint  on  x  (since  it  is  the  intersection  of  an  infinite  number 
of  halfspaces),  but  may  not  lead  to  a  tractable  problem  except  in  special  cases 
that  exploit  the  particular  structure  of  the  functions.  One  simple  example  occurs 
when  the  functions  are  piecewise-linear.  In  this  case,  if  the  function  values  are 
nonnegative  at  the  grid  points,  the  function  is  nonnegative  everywhere,  so  we  obtain 
a  simple  (finite)  set  of  linear  inequalities. 

As  a  less  trivial  example,  consider  the  case  when  the  functions  are  polynomials 
on  R,  with  even  maximum  degree  2k  ( i.e .,  n  =  2k  +  1),  and  D  =  R.  As  shown  in 
exercise  2.37,  page  65,  the  nonnegativity  constraint 

p(u)  =  x\  +  X2 u  +  •  •  •  +  X2k+ru2k  >  0  for  all  u  £  R, 

is  equivalent  to 

Xi=  ^2  Yrnn.  i  =  1, . . .  ,2k  +  1,  Y  y  0, 

m-\-n=i-\- 1 


where  Y  £  Sfc+1  is  an  auxiliary  variable. 
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Derivative  constraints 

Suppose  the  basis  functions  /)  are  differentiable  at  a  point  v  £  D.  The  gradient 

n 

V/(«)  =  5><V/i(v), 

f=i 

is  a  linear  function  of  a:,  so  interpolation  conditions  on  the  derivative  of  /  at  v 
reduce  to  linear  equality  constraints  on  x.  Requiring  that  the  norm  of  the  gradient 
at  v  not  exceed  a  given  limit, 

n 

l|V/(u)||=  ]>>,V/,{r)  <M, 

i= 1 

is  a  convex  constraint  on  x.  The  same  idea  extends  to  higher  derivatives.  For 
example,  if  /  is  twice  differentiable  at  v,  the  requirement  that 

II  A  V2f(v)  A  til 

is  a  linear  matrix  inequality  in  x,  hence  convex. 

We  can  also  impose  constraints  on  the  derivatives  at  an  infinite  number  of 
points.  For  example,  we  can  require  that  /  is  monotone: 

f(u)  >  f(v)  for  all  u,  v  £  D,  u  >z  v. 

This  is  a  convex  constraint  in  x,  but  may  not  lead  to  a  tractable  problem  except  in 
special  cases.  When  /  is  piecewise  affine,  for  example,  the  monotonicity  constraint 
is  equivalent  to  the  condition  V/( v)  0  inside  each  of  the  simplexes.  Since  the 
gradient  is  a  linear  function  of  the  grid  point  values,  this  leads  to  a  simple  (finite) 
set  of  linear  inequalities. 

As  another  example,  we  can  require  that  the  function  be  convex,  i.e.,  satisfy 

f((u  +  v)/2)  <  ( f(u )  +  f(v))/ 2  for  all  it,  v  £  D 

(which  is  enough  to  ensure  convexity  when  /  is  continuous).  This  is  a  convex  con¬ 
straint,  which  has  a  tractable  representation  in  some  cases.  One  obvious  example 
is  when  /  is  quadratic,  in  which  case  the  convexity  constraint  reduces  to  the  re¬ 
quirement  that  the  quadratic  part  of  /  be  nonnegative,  which  is  an  LMI.  Another 
example  in  which  a  convexity  constraint  leads  to  a  tractable  problem  is  described 
in  more  detail  in  §6.5.5. 

Integral  constraints 

Any  linear  functional  C  on  the  subspace  of  functions  can  be  expressed  as  a  linear 
function  of  x,  i.e.,  we  have  £(/)  =  cTx.  Evaluation  of  /  (or  a  derivative)  at  a  point 
is  just  a  special  case.  As  another  example,  the  linear  functional 

£(f)  =  [  4>{u)f{u)  du , 

Jd 
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where  </>  :  Rfe  —¥  R,  can  be  expressed  as  £(f)  =  cTx ,  where 


Ci=  <j>(u)fi(u )  du. 

Jd 


Thus,  a  constraint  of  the  form  C(f)  =  a  is  a  linear  equality  constraint  on  x.  One 
example  of  such  a  constraint  is  the  moment  constraint 


[  tmf{t )  dt  =  a 

Jd 


(where  /  :  R  — >  R) . 


6.5.3  Fitting  and  interpolation  problems 

Minimum  norm  function  fitting 

In  a  fitting  problem,  we  are  given  data 

(ui,yi),  (um,  ym) 

with  Ui  £  D  and  y*  £  R,  and  seek  a  function  /  £  T  that  matches  this  data  as 
closely  as  possible.  For  example  in  least-squares  fitting  we  consider  the  problem 

minimize  Eili(/K)  -  2/i)2, 

which  is  a  simple  least-squares  problem  in  the  variable  x.  We  can  add  a  variety  of 
constraints,  for  example  linear  inequalities  that  must  be  satisfied  by  /  at  various 
points,  constraints  on  the  derivatives  of  /,  monotonicity  constraints,  or  moment 
constraints. 


Example  6.7  Polynomial  fitting.  We  are  given  data  m, . . . ,  Um  £  R  and  vi, . . . ,  vm  £ 
R,  and  hope  to  approximately  fit  a  polynomial  of  the  form 

p{u)  =  Xi  +  X2 u  + - h  XnU1-1 

to  the  data.  For  each  x  we  form  the  vector  of  errors, 

e  =  (p(tll)  -  Vl,...  ,p(um)  -  Vm)  ■ 

To  find  the  polynomial  that  minimizes  the  norm  of  the  error,  we  solve  the  norm 
approximation  problem 


minimize  ||e||  =  ||  Ax  —  v|| 

with  variable  x  £  Rn,  where  Aij  =  u\~x ,  i  =o.  1, . . . ,  m,  j  =  1, . . . ,  n. 

Figure  6.19  shows  an  example  with  m  =  40  data  points  and  n  =  6  ( i.e .,  polynomials 
of  maximum  degree  5),  for  the  £2-  and  foo-norms. 
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Figure  6.19  Two  polynomials  of  degree  5  that  approximate  the  40  data 
points  shown  as  circles.  The  polynomial  shown  as  a  solid  line  minimizes  the 
^2-norm  of  the  error;  the  polynomial  shown  as  a  dashed  line  minimizes  the 
foo-norm. 


Figure  6.20  Two  cubic  splines  that  approximate  the  40  data  points  shown  as 
circles  (which  are  the  same  as  the  data  in  figure  6.19).  The  spline  shown  as 
a  solid  line  minimizes  the  ^2-norm  of  the  error;  the  spline  shown  as  a  dashed 
line  minimizes  the  too-norrn.  As  in  the  polynomial  approximation  shown  in 
figure  6.19,  the  dimension  of  the  subspace  of  fitting  functions  is  6. 
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Example  6.8  Spline  fitting.  Figure  6.20  shows  the  same  data  as  in  example  6.7, 
and  two  optimal  fits  with  cubic  splines.  The  interval  [—1,1]  is  divided  into  three 
equal  intervals,  and  we  consider  piecewise  polynomials,  with  maximum  degree  3,  with 
continuous  first  and  second  derivatives.  The  dimension  of  this  subspace  of  functions 
is  6,  the  same  as  the  dimension  of  polynomials  with  maximum  degree  5,  considered 
in  example  6.7. 


In  the  simplest  forms  of  function  fitting,  we  have  m  n,  i.e.,  the  number 
of  data  points  is  much  larger  than  the  dimension  of  the  subspace  of  functions. 
Smoothing  is  accomplished  automatically,  since  all  members  of  the  subspace  are 
smooth. 

Least-norm  interpolation 

In  another  variation  of  function  fitting,  we  have  fewer  data  points  than  the  dimen¬ 
sion  of  the  subspace  of  functions.  In  the  simplest  case,  we  require  that  the  function 
we  choose  must  satisfy  the  interpolation  conditions 

/(«»)  =  J/ii  *  =  1,  •  •  • ,  m, 

which  are  linear  equality  constraints  on  x.  Among  the  functions  that  satisfy  these 
interpolation  conditions,  we  might  seek  one  that  is  smoothest,  or  smallest.  These 
lead  to  least-norm  problems. 

In  the  most  general  function  fitting  problem,  we  can  optimize  an  objective 
(such  as  some  measure  of  the  error  e),  subject  to  a  variety  of  convex  constraints 
that  represent  our  prior  knowledge  of  the  underlying  function. 

Interpolation,  extrapolation,  and  bounding 

By  evaluating  the  optimal  function  fit  /  at  a  point  v  not  in  the  original  data  set, 
we  obtain  a  guess  of  what  the  value  of  the  underlying  function  is,  at  the  point  v. 
This  is  called  interpolation  when  v  is  between  or  near  the  given  data  points  ( e.g ., 
v  €  conv{iq, . . . ,  um}),  and  extrapolation  otherwise. 

We  can  also  produce  an  interval  in  which  the  value  f(v)  can  lie,  by  maximizing 
and  minimizing  (the  linear  function)  f(v),  subject  to  the  constraints.  We  can  use 
the  function  fit  to  help  identify  faulty  data  or  outliers.  Here  we  might  use,  for 
example,  an  t\ -norm  fit,  and  look  for  data  points  with  large  errors. 


6.5.4  Sparse  descriptions  and  basis  pursuit 

In  basis  pursuit ,  there  is  a  very  large  number  of  basis  functions,  and  the  goal  is  to 
find  a  good  fit  of  the  given  data  as  a  linear  combination  of  a  small  number  of  the 
basis  functions.  (In  this  context  the  function  family  is  linearly  dependent,  and  is 
sometimes  referred  to  as  an  over-complete  basis  or  dictionary.)  This  is  called  basis 
pursuit  since  we  are  selecting  a  much  smaller  basis,  from  the  given  over-complete 
basis,  to  model  the  data. 
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Thus  we  seek  a  function  /  €  T  that  fits  the  data  well, 

/(u»)  ~  Vi,  i  =  l,...,m, 

with  a  sparse  coefficient  vector  x,  i.e.,  card(x)  small.  In  this  case  we  refer  to 

/  =  Xifi  H - h  xnfn  =Yxjfj, 

i&B 

where  B  =  {*  |  Xi  ^  0}  is  the  set  of  indices  of  the  chosen  basis  elements,  as  a  sparse 
description  of  the  data.  Mathematically,  basis  pursuit  is  the  same  as  the  regressor 
selection  problem  (see  §6.4),  but  the  interpretation  (and  scale)  of  the  optimization 
problem  are  different. 

Sparse  descriptions  and  basis  pursuit  have  many  uses.  They  can  be  used  for 
de-noising  or  smoothing,  or  data  compression  for  efficient  transmission  or  storage 
of  a  signal.  In  data  compression,  the  sender  and  receiver  both  know  the  dictionary, 
or  basis  elements.  To  send  a  signal  to  the  receiver,  the  sender  first  finds  a  sparse 
representation  of  the  signal,  and  then  sends  to  the  receiver  only  the  nonzero  coef¬ 
ficients  (to  some  precision).  Using  these  coefficients,  the  receiver  can  reconstruct 
(an  approximation  of)  the  original  signal. 

One  common  approach  to  basis  pursuit  is  the  same  as  the  method  for  regressor 
selection  described  in  §6.4,  and  based  on  td-nornr  regularization  as  a  heuristic  for 
finding  sparse  descriptions.  We  first  solve  the  convex  problem 

minimize  EIli(/K)  “  Vi)2  +  lIMli,  (6-18) 

where  7  >  0  is  a  parameter  used  to  trade  off  the  quality  of  the  fit  to  the  data, 
and  the  sparsity  of  the  coefficient  vector.  The  solution  of  this  problem  can  be  used 
directly,  or  followed  by  a  refinement  step,  in  which  the  best  fit  is  found,  using  the 
sparsity  pattern  of  the  solution  of  (6.18).  In  other  words,  we  first  solve  (6.18),  to 
obtain  x.  We  then  set  B  =  {i  \  Xi  ^  0},  i.e.,  the  set  of  indices  corresponding  to 
nonzero  coefficients.  Then  we  solve  the  least-squares  problem 

minimize  XXi(/(ui)  -  Vi)2 

with  variables  27,  i  €  B,  and  Xi  =  0  for  i  B. 

In  basis  pursuit  and  sparse  description  applications  it  is  not  uncommon  to  have 
a  very  large  dictionary,  with  n  on  the  order  of  104  or  much  more.  To  be  effective, 
algorithms  for  solving  (6.18)  must  exploit  problem  structure,  which  derives  from 
the  structure  of  the  dictionary  signals. 

Time-frequency  analysis  via  basis  pursuit 

In  this  section  we  illustrate  basis  pursuit  and  sparse  representation  with  a  simple 
example.  We  consider  functions  (or  signals)  on  R,  with  the  range  of  interest  [0, 1]. 
We  think  of  the  independent  variable  as  time,  so  we  use  t  (instead  of  u)  to  denote 
it. 

We  first  describe  the  basis  functions  in  the  dictionary.  Each  basis  function  is  a 
Gaussian  sinusoidal  pulse,  or  Gabor  function,  with  form 

e-(t-T)2/*2 


cos  (ut  +  4>), 
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Figure  6.21  Three  of  the  basis  elements  in  the  dictionary,  all  with  center  time 
r  =  0.5  and  cosine  phase.  The  top  signal  has  frequency  u  =  0,  the  middle 
one  has  frequency  u>  =  75,  and  the  bottom  one  has  frequency  u>  =  150. 


where  a  >  0  gives  the  width  of  the  pulse,  r  is  the  time  of  (the  center  of)  the  pulse, 
lo  >  0  is  the  frequency,  and  is  the  phase  angle.  All  of  the  basis  functions  have 
width  a  =  0.05.  The  pulse  times  and  frequencies  are 

t  =  0.002k,  k  =  0,  ...,500,  oj  =  5k,  k  =  0, . . . ,  30. 

For  each  time  r,  there  is  one  basis  element  with  frequency  zero  (and  phase  </>  =  0), 
and  2  basis  elements  (cosine  and  sine,  i.e.,  phase  (f>  =  0  and  <j>  =  tt/2)  for  each  of  30 
remaining  frequencies,  so  all  together  there  are  501  x  61  =  30561  basis  elements. 
The  basis  elements  are  naturally  indexed  by  time,  frequency,  and  phase  (cosine  or 
sine),  so  we  denote  them  as 

U,  U,C  r  =  0, 0.002,...,  1,  u  =  0,5,...,  150, 

f r,uj,st  t  =  0, 0.002,...,  1,  w  =  5, . . . ,  150. 

Three  of  these  basis  functions  (all  with  time  r  =  0.5)  are  shown  in  figure  6.21. 

Basis  pursuit  with  this  dictionary  can  be  thought  of  as  a  time-frequency  analysis 
of  the  data.  If  a  basis  element  fT,u>, c  or  fr, u,s  appears  in  the  sparse  representation 
of  a  signal  (i.e.,  with  a  nonzero  coefficient),  we  can  interpret  this  as  meaning  that 
the  data  contains  the  frequency  ui  at  time  r. 

We  will  use  basis  pursuit  to  find  a  sparse  approximation  of  the  signal 


y(t)  =  aft)  sin  Oft) 
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Figure  6.22  Top.  The  original  signal  (solid  line)  and  approximation  y  ob¬ 
tained  by  basis  pursuit  (dashed  line)  are  almost  indistinguishable.  Bottom. 
The  approximation  error  y(t)  —  y(t),  with  different  vertical  scale. 


where 

aft)  =  1  +  0.5sin(lli),  9{t)  =  30sin(5t). 

(This  signal  is  chosen  only  because  it  is  simple  to  describe,  and  exhibits  noticeable 
changes  in  its  spectral  content  over  time.)  We  can  interpret  aft)  as  the  signal 
amplitude,  and  9(t)  as  its  total  phase.  We  can  also  interpret 


w(t) 


d9 

dt 


150|  cos(5£)| 


as  the  instantaneous  frequency  of  the  signal  at  time  t.  The  data  are  given  as  501 
uniformly  spaced  samples  over  the  interval  [0, 1],  i.e.,  we  are  given  501  pairs  ( tk,yk ) 
with 

tk  =  0.005k,  Vk=y{tk),  k  =  0,  ...,500. 

We  first  solve  the  £?i-norm  regularized  least-squares  problem  (6.18),  with  7  = 
1.  The  resulting  optimal  coefficient  vector  is  very  sparse,  with  only  42  nonzero 
coefficients  out  of  30561.  We  then  find  the  least-squares  fit  of  the  original  signal 
using  these  42  basis  vectors.  The  result  y  is  compared  with  the  original  signal 
y  in  figure  6.22.  The  top  figure  shows  the  approximated  signal  (in  dashed  line) 
and,  almost  indistinguishable,  the  original  signal  yft)  (in  solid  line).  The  bottom 
figure  shows  the  error  yft)  —  y(t).  As  is  clear  from  the  figure,  we  have  obtained  an 
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Figure  6.23  Top:  Original  signal.  Bottom:  Time- frequency  plot.  The  dashed 
curve  shows  the  instantaneous  frequency  w(t)  =  150|  cos(5f)|  of  the  original 
signal.  Each  circle  corresponds  to  a  chosen  basis  element  in  the  approxima¬ 
tion  obtained  by  basis  pursuit.  The  horizontal  axis  shows  the  time  index  r, 
and  the  vertical  axis  shows  the  frequency  index  u>  of  the  basis  element. 


approximation  y  with  a  very  good  relative  fit.  The  relative  error  is 

(1/501)  Ef°i  26  1(r4 

(i/soi) 

By  plotting  the  pattern  of  nonzero  coefficients  versus  time  and  frequency,  we 
obtain  a  time-frequency  analysis  of  the  original  data.  Such  a  plot  is  shown  in  fig¬ 
ure  6.23,  along  with  the  instantaneous  frequency.  The  plot  shows  that  the  nonzero 
components  closely  track  the  instantaneous  frequency. 


6.5.5  Interpolation  with  convex  functions 

In  some  special  cases  we  can  solve  interpolation  problems  involving  an  infinite- 
dimensional  set  of  functions,  using  finite-dimensional  convex  optimization.  In  this 
section  we  describe  an  example. 

We  start  with  the  following  question:  When  does  there  exist  a  convex  function 
/  :  Rk  R,  with  dom  /  =  Rfe,  that  satisfies  the  interpolation  conditions 


f{ui)  =  yu  i  = 
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at  given  points  iq  £  Rfc?  (Here  we  do  not  restrict  /  to  lie  in  any  finite-dimensional 
subspace  of  functions.)  The  answer  is:  if  and  only  if  there  exist  g  1, . . .  ,gm  such 
that 

Dj  >  Vi  +9l(uj  ~  Ui),  i,j  =  l,...,m.  (6.19) 

To  see  this,  first  suppose  that  /  is  convex,  dom/  =  Rfc,  and  fiui)  =  Vi, 
i  =  1, . . . ,  to.  At  each  m  we  can  find  a  vector  gi  such  that 

f{z)>  f(ui)+gf(z-Ui)  (6.20) 

for  all  2.  If  f  is  differentiable,  we  can  take  gi  =  V/(ttj);  in  the  more  general  case, 
we  can  construct  gi  by  finding  a  supporting  hyperplane  to  epi /  at  (it*,?/,).  (The 
vectors  gi  are  called  subgradients.)  By  applying  (6.20)  to  z  =  Uj ,  we  obtain  (6.19). 
Conversely,  suppose  g\, . . .  ,gm  satisfy  (6.19).  Define  /  as 

f(z)  =  max  (yt  +  gj (z  -  m)) 

i—l 

for  all  z  £  Rfc.  Clearly,  /  is  a  (piecewise- linear)  convex  function.  The  inequali¬ 
ties  (6.19)  imply  that  f(ui)  =  yi,  for  i  =  1, . . . ,  m. 

We  can  use  this  result  to  solve  several  problems  involving  interpolation,  approx¬ 
imation,  or  bounding,  with  convex  functions. 

Fitting  a  convex  function  to  given  data 

Perhaps  the  simplest  application  is  to  compute  the  least-squares  fit  of  a  convex 
function  to  given  data  ( Ui ,  yi),  i  =  1, . . . ,  m: 

minimize  Y™=1{Vi  -  f(u.i))2 

subject  to  /  :  Rfc  — >  R  is  convex,  dom  /  =  Rfe . 

This  is  an  infinite-dimensional  problem,  since  the  variable  is  /,  which  is  in  the 
space  of  continuous  real- valued  functions  on  Rfe.  Using  the  result  above,  we  can 
formulate  this  problem  as 

minimize  Y)iLi(Vi  ~  Vi)2 

subject  to  yj  >  &  +  gf  (uj  -  u*),  i,  j  =  1, . . . ,  m, 

which  is  a  QP  with  variables  y  £  Rm  and  g i, ,  gm  £  RA  •  The  optimal  value  of 
this  problem  is  zero  if  and  only  if  the  given  data  can  be  interpolated  by  a  convex 
function,  i.e.,  if  there  is  a  convex  function  that  satisfies  /(«,:)  =  y%.  An  example  is 
shown  in  figure  6.24. 

Bounding  values  of  an  interpolating  convex  function 

As  another  simple  example,  suppose  that  we  are  given  data  (u, , yi: ) ,  i  =  1 , ...  ,m, 
which  can  be  interpolated  by  a  convex  function.  We  would  like  to  determine  the 
range  of  possible  values  of  f(uo),  where  uq  is  another  point  in  R  ,  and  /  is  any 
convex  function  that  interpolates  the  given  data.  To  find  the  smallest  possible 
value  of  f(uo)  we  solve  the  LP 

minimize  yo 

subject  to  yj  >  yi  +  gf  (uj  -  u*),  i,  j  =  0, . . . ,  m, 
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Figure  6.24  Least-squares  fit  of  a  convex  function  to  data,  shown  as  circles. 
The  (piecewise-linear)  function  shown  minimizes  the  sum  of  squared  fitting 
error,  over  all  convex  functions. 


which  is  an  LP  with  variables  yo  6  R,  go1  ■  ■  ■  ,gm  £  Rfc.  By  maximizing  j/o  (which 
is  also  an  LP)  we  find  the  largest  possible  value  of  /(wo)  for  a  convex  function  that 
interpolates  the  given  data. 

Interpolation  with  monotone  convex  functions 

As  an  extension  of  convex  interpolation,  we  can  consider  interpolation  with  a  convex 
and  monotone  nondecreasing  function.  It  can  be  shown  that  there  exists  a  convex 
function  /  :  Rfc  — >■  R,  with  dom  /  =  Rfe,  that  satisfies  the  interpolation  conditions 

f(ui)  =  yi,  i  =  1, . . .  ,m, 

and  is  monotone  nondecreasing  (he.,  f(u)  >  f(v)  whenever  u  >;  v),  if  and  only  if 
there  exist  gi, . . . ,  gm  £  Rfe,  such  that 

gih  0,  i  =  Vj  >yi+ gj{uj  -  u{),  =  (6.21) 

In  other  words,  we  add  to  the  convex  interpolation  conditions  (6.19),  the  condition 
that  the  subgradients  g-i  are  all  nonnegative.  (See  exercise  6.12.) 

Bounding  consumer  preference 

As  an  application,  we  consider  a  problem  of  predicting  consumer  preferences.  We 
consider  different  baskets  of  goods ,  consisting  of  different  amounts  of  n  consumer 
goods.  A  goods  basket  is  specified  by  a  vector  x  £  [0, 1]”  where  Xi  denotes  the 
amount  of  consumer  good  i.  We  assume  the  amounts  are  normalized  so  that 
0  <  Xi  <  1,  i.e.,  Xi  =  0  is  the  minimum  and  x,  =  1  is  the  maximum  possible 
amount  of  good  i.  Given  two  baskets  of  goods  x  and  x,  a  consumer  can  either 
prefer  x  to  x,  or  prefer  x  to  x,  or  consider  x  and  x  equally  attractive.  We  consider 
one  model  consumer,  whose  choices  are  repeatable. 
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We  model  consumer  preference  in  the  following  way.  We  assume  there  is  an 
underlying  utility  function  u  :  R"  — >  R,  with  domain  [0, 1]™;  u(x)  gives  a  measure 
of  the  utility  derived  by  the  consumer  from  the  goods  basket  x.  Given  a  choice 
between  two  baskets  of  goods,  the  consumer  chooses  the  one  that  has  larger  utility, 
and  will  be  ambivalent  when  the  two  baskets  have  equal  utility.  It  is  reasonable  to 
assume  that  u  is  monotone  nondecreasing.  This  means  that  the  consumer  always 
prefers  to  have  more  of  any  good,  with  the  amounts  of  all  other  goods  the  same.  It 
is  also  reasonable  to  assume  that  u  is  concave.  This  models  satiation,  or  decreasing 
marginal  utility  as  we  increase  the  amount  of  goods. 

Now  suppose  we  are  given  some  consumer  preference  data,  but  we  do  not  know 
the  underlying  utility  function  u.  Specifically,  we  have  a  set  of  goods  baskets 
Oi, . . . ,  am  £  [0, 1]™,  and  some  information  about  preferences  among  them: 


u(a,i)  >  u(a,j)  for  (i,  j)  £  V,  u(ai)  >  u{aj)  for  (i,j)  £  Pweak,  (6.22) 


where  V,  Pweak  Q  {1,  -  -  - ,  m}  x  {1, . . . ,  m}  are  given.  Here  V  gives  the  set  of  known 
preferences:  (i,j)  £  V  means  that  basket  a,;  is  known  to  be  preferred  to  basket  aj. 
The  set  'Pweak  gives  the  set  of  known  weak  preferences:  (i,j)  £  Pweak  means  that 
basket  oq  is  preferred  to  basket  aj,  or  that  the  two  baskets  are  equally  attractive. 

We  first  consider  the  following  question:  How  can  we  determine  if  the  given  data 
are  consistent,  i.e.,  whether  or  not  there  exists  a  concave  nondecreasing  utility 
function  u  for  which  (6.22)  holds?  This  is  equivalent  to  solving  the  feasibility 
problem 

find  u 

subject  to  u  :  Rn  —>  R  concave  and  nondecreasing  ,  . 

u(ai)>u(aj),  ( i,j)£V  1  j 

aijXjfj  ^  u(ctj),  (f,j)  £  Pweak; 

with  the  function  u  as  the  (infinite-dimensional)  optimization  variable.  Since  the 
constraints  in  (6.23)  are  all  homogeneous,  we  can  express  the  problem  in  the  equiv¬ 
alent  form 


find  u 

subject  to  u  :  R"  R  concave  and  nondecreasing 
u(ai)  >u(aj)  +  1,  (i,j)  £  V 

aiafj  ^  u(ctj),  (^,j)  £  Pweak; 


which  uses  only  nonstrict  inequalities.  (It  is  clear  that  if  u  satisfies  (6.24),  then 
it  must  satisfy  (6.23);  conversely,  if  u  satisfies  (6.23),  then  it  can  be  scaled  to 
satisfy  (6.24).)  This  problem,  in  turn,  can  be  cast  as  a  (finite-dimensional)  linear 
programming  feasibility  problem,  using  the  interpolation  result  on  page  339: 


find  ui,...,um,  gi,...,gm 

subject  to  gi  y  0,  i  =  1, . . .  ,m 

uj  <  Ui  +  gj {aj  -  a*),  i,  j  =  1,. . .  ,m  (6.25) 

Ui>Uj  +  1,  (i,j)  £  V 

Ui  y  Uj  ,  ( i ,  j  )  £  H weak  ■ 

By  solving  this  linear  programming  feasibility  problem,  we  can  determine  whether 
there  exists  a  concave,  nondecreasing  utility  function  that  is  consistent  with  the 
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given  sets  of  strict  and  nonstrict  preferences.  If  (6.25)  is  feasible,  there  is  at  least 
one  such  utility  function  (and  indeed,  we  can  construct  one  that  is  piecewise-linear, 
from  a  feasible  «i, . . . ,  um,  gi, . . . ,  gm )•  If  (6.25)  is  not  feasible,  we  can  conclude 
that  there  is  no  concave  increasing  utility  function  that  is  consistent  with  the  given 
sets  of  strict  and  nonstrict  preferences. 

As  an  example,  suppose  that  V  and  Pweak  are  consumer  preferences  that  are 
known  to  be  consistent  with  at  least  one  concave  increasing  utility  function.  Con¬ 
sider  a  pair  (k,l)  that  is  not  in  V  or  Vweak,  Le.,  consumer  preference  between 
baskets  k  and  l  is  not  known.  In  some  cases  we  can  conclude  that  a  preference 
holds  between  basket  k  and  /,  even  without  knowing  the  underlying  preference 
function.  To  do  this  we  augment  the  known  preferences  (6.22)  with  the  inequality 
u(ofc)  <  u(ai),  which  means  that  basket  l  is  preferred  to  basket  k,  or  they  are 
equally  attractive.  We  then  solve  the  feasibility  linear  program  (6.25),  including 
the  extra  weak  preference  u(ak)  <  u(ai).  If  the  augmented  set  of  preferences  is  in¬ 
feasible,  it  means  that  any  concave  nondecreasing  utility  function  that  is  consistent 
with  the  original  given  consumer  preference  data  must  also  satisfy  u(a,k)  >  u(ai). 
In  other  words,  we  can  conclude  that  basket  k  is  preferred  to  basket  /,  without 
knowing  the  underlying  utility  function. 


Example  6.9  Here  we  give  a  simple  numerical  example  that  illustrates  the  discussion 
above.  We  consider  baskets  of  two  goods  (so  we  can  easily  plot  the  goods  baskets). 
To  generate  the  consumer  preference  data  V ,  we  compute  40  random  points  in  [0, 1] 2 , 
and  then  compare  them  using  the  utility  function 

u(x  i,  *2)  =  (l.laq^2  +  0.8a:.!/2)/1.9. 

These  goods  baskets,  and  a  few  level  curves  of  the  utility  function  u,  are  shown  in 
figure  6.25. 

We  now  use  the  consumer  preference  data  (but  not,  of  course,  the  true  utility  function 
ti)  to  compare  each  of  these  40  goods  baskets  to  the  basket  a 0  =  (0.5,  0.5).  For  each 
original  basket  a,i,  we  solve  the  linear  programming  feasibility  problem  described 
above,  to  see  if  we  can  conclude  that  basket  a 0  is  preferred  to  basket  ai.  Similarly, 
we  check  whether  we  can  conclude  that  basket  a;  is  preferred  to  basket  a 0.  For  each 
basket  a;,  there  are  three  possible  outcomes:  we  can  conclude  that  ao  is  definitely 
preferred  to  ai,  that  ai  is  definitely  preferred  to  ao,  or  (if  both  LP  feasibility  problems 
are  feasible)  that  no  conclusion  is  possible.  (Here,  definitely  preferred  means  that  the 
preference  holds  for  any  concave  nondecreasing  utility  function  that  is  consistent  with 
the  original  given  data.) 

We  find  that  21  of  the  baskets  are  definitely  rejected  in  favor  of  (0.5,  0.5),  and  14 
of  the  baskets  are  definitely  preferred.  We  cannot  make  any  conclusion,  from  the 
consumer  preference  data,  about  the  remaining  5  baskets.  These  results  are  shown  in 
figure  6.26.  Note  that  goods  baskets  below  and  to  the  left  of  (0.5,  0.5)  will  definitely 
be  rejected  in  favor  of  (0.5, 0.5),  using  only  the  monotonicity  property  of  the  utility 
function,  and  similarly,  those  points  that  are  above  and  to  the  right  of  (0.5,  0.5)  must 
be  preferred.  So  for  these  17  points,  there  is  no  need  to  solve  the  feasibility  LP  (6.25). 
Classifying  the  23  points  in  the  other  two  quadrants,  however,  requires  the  concavity 
assumption,  and  solving  the  feasibility  LP  (6.25). 
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Figure  6.25  Forty  goods  baskets  di,...,d4o,  shown  as  circles.  The 
0.1,  0.2, . . . ,  0.9  level  curves  of  the  true  utility  function  u  are  shown  as  dashed 
lines.  This  utility  function  is  used  to  find  the  consumer  preference  data  V 
among  the  40  baskets. 


(M 


X\ 


Figure  6.26  Results  of  consumer  preference  analysis  using  the  LP  (6.25),  for  a 
new  goods  basket  do  =  (0.5,  0.5).  The  original  baskets  are  displayed  as  open 
circles  if  they  are  definitely  rejected  ( u(at )  <  u(ao)),  as  solid  black  circles 
if  they  are  definitely  preferred  (u(a k)  >  u(do)),  and  as  squares  when  no 
conclusion  can  be  made.  The  level  curve  of  the  underlying  utility  function, 
that  passes  through  (0.5,  0.5),  is  shown  as  a  dashed  curve.  The  vertical  and 
horizontal  lines  passing  through  (0.5,  0.5)  divide  [0,  l]2  into  four  quadrants. 
Points  in  the  upper  right  quadrant  must  be  preferred  to  (0.5,  0.5),  by  the 
monotonicity  assumption  on  it.  Similarly,  (0.5,  0.5)  must  be  preferred  to  the 
points  in  the  lower  left  quadrant.  For  the  points  in  the  other  two  quadrants, 
the  results  are  not  obvious. 
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Exercises 


Norm  approximation  and  least-norm  problems 

6.1  Quadratic  bounds  for  log  barrier  penalty.  Let  <f>  :  R  — »  R  be  the  log  barrier  penalty 
function  with  limit  a  >  0: 


(j>{u)  = 

Show  that  if  u  £  Rm  satisfies  ||«||oo  <  a,  then 

m 

<  < 


—a2  log(l  —  (u/a)2)  |u|  <  a 

oo  otherwise. 


IMl! 


This  means  that  Xy"=i  0(M»)  is  well  approximated  by  ||w||!  if  ||u||oo  is  small  compared  to 
a.  For  example,  if  ||u||oo/®  =  0.25,  then 


«ll!<  E  <j>(ui)  <  1.033-  ||u||i 

i=  1 


6.2  i\~,  I2-,  and  iao-norm  approximation  by  a  constant  vector.  What  is  the  solution  of  the 
norm  approximation  problem  with  one  scalar  variable  x  £  R, 

minimize  ||ml  —  6|| , 


for  the  £1-,  £2-,  and  ^-norms? 

6.3  Formulate  the  following  approximation  problems  as  LPs,  QPs,  SOCPs,  or  SDPs.  The 
problem  data  are  A  £  Rmxn  and  £  Rm.  The  rows  of  A  are  denoted  af . 

(a)  Deadzone-linear  penalty  approximation:  minimize  Xw-i  x  ~  hi),  where 


(j>(u) 


0  \u\  <  a 

|u|  —  a  |u|  >  a, 


where  a  >  0. 

(b)  Log-barrier  penalty  approximation:  minimize  ‘K®? x  ~  &»)j  where 


— a2  log(l  —  (u/a)2)  |u|  <  a 

00  |u|  >  o, 


with  a  >  0. 

(c)  Huber  penalty  approximation:  minimize  YlT-i  (f>(ai'x  ~  &«)>  where 


<j>(u) 


u2  |u|  <  M 

M(2\u\  -  M)  juj  >  M, 


with  M  >0. 

(d)  Log-Chebyshev  approximation:  minimize  maxf=ij...i„l  |  log(afa:)  —  log bi\.  We  assume 
6^0.  An  equivalent  convex  form  is 


minimize  t 

subject  to  1/f  <  ajx/bi  <t,  i  =  1, . . .  ,m, 


with  variables  x  £  Rn  and  t  £  R,  and  domain  Rn  x  R++. 
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(e)  Minimizing  the  sum  of  the  largest  k  residuals: 

minimize  X)Li  lrl[*l 
subject  to  r  =  Ax  —  b, 


where  |r|[i]  >  |r|[2]  >  •••  >  |r|[m]  are  the  numbers  |r*i |,  |t2 | ,  |rm|  sorted  in 
decreasing  order.  (For  k  =  1,  this  reduces  to  foo-norm  approximation;  for  k  =  m,  it 
reduces  to  £i-norm  approximation.)  Hint.  See  exercise  5.19. 

6.4  A  differentiable  approximation  of  ii-norm  approximation.  The  function  =  (M2+e)1^2, 
with  parameter  e  >  0,  is  sometimes  used  as  a  differentiable  approximation  of  the  absolute 
value  function  |u|.  To  approximately  solve  the  fb-norm  approximation  problem 

minimize  ||j4*  — 6||i,  (6.26) 

where  A  £  Rmxn,  we  solve  instead  the  problem 

minimize  x  —  bj),  (6.27) 

where  aj  is  the  ith  row  of  A.  We  assume  rank  A  =  n. 

Let  p *  denote  the  optimal  value  of  the  fb-norm  approximation  problem  (6.26).  Let  * 
denote  the  optimal  solution  of  the  approximate  problem  (6.27),  and  let  f  denote  the 
associated  residual,  f  =  Ax  —  b. 

(a)  Show  that  p*  >  J2T=i  ifl  +  e)1/2. 

(b)  Show  that 

WM  ~  6I|!  ^  P*  +  J2  f1  -  (f.2  ^'1)1/2 

i= 1  ' 

(By  evaluating  the  righthand  side  after  computing  x.  we  obtain  a  bound  on  how  subop- 
timal  x  is  for  the  fb-norm  approximation  problem.) 

6.5  Minimum  length  approximation.  Consider  the  problem 

minimize  length)*) 
subject  to  ||  Ax  —  6||  <  e, 


where  length)*)  =  min{fe  |  Xi  =  0  for  i  >  fc}.  The  problem  variable  is  *  £  R’1;  the 
problem  parameters  are  A  £  Rmxrl,  b  £  Rm,  and  e  >  0.  In  a  regression  context,  we  are 
asked  to  find  the  minimum  number  of  columns  of  A,  taken  in  order,  that  can  approximate 
the  vector  b  within  e. 

Show  that  this  is  a  quasiconvex  optimization  problem. 

6.6  Duals  of  some  penalty  function  approximation  problems.  Derive  a  Lagrange  dual  for  the 
problem 

minimize  YjZ=i 
subject  to  r  =  Ax  —  6, 

for  the  following  penalty  functions  <f>  :  R  — »  R.  The  variables  are  *  £  R™,  r  £  Rm. 

(a)  Deadzone-linear  penalty  (with  deadzone  width  a  =  1), 


r  o  m 
\  M  - 1  M  >  1 


(b)  Huber  penalty  (with  M  =  1), 


<X«) 


u2  M  <  1 

2|u|-l  juj  >  1. 
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(c)  Log-barrier  (with  limit  a  =  1), 

=  —  log(l  -  m2), 

(d)  Relative  deviation  from  one, 

4>(u)  =  max{it,  1/m} 

with  dome/  =  R++. 

Regularization  and  robust  approximation 

6.7  Bi-criterion  optimization  with  Euclidean  norms.  We  consider  the  bi-criterion  optimization 
problem 

minimize  (w.r.t.  R+)  (\\Ax  —  &||1,  ||*|||), 

where  A  £  Rmxn  has  rank  r,  and  b  £  Rm.  Show  how  to  find  the  solution  of  each  of  the 
following  problems  from  the  singular  value  decomposition  of  A , 

r 

A  =  U  diag(o')l/T  =  (JiUivf 

i=  1 


dom  (j>  =  (—1,1). 


{it  u  >  1 
1/m  u  <  1, 


(see  §A.5.4). 

(a)  Tikhonov  regularization:  minimize  \\Ax  —  &|||  +  <5|| a; || § . 

(b)  Minimize  \\Ax  —  b\\%  subject  to  11*111  =  7- 

(c)  Maximize  \\Ax  —  6||2  subject  to  ||*|||  =  7- 

Here  <5  and  7  are  positive  parameters. 

Your  results  provide  efficient  methods  for  computing  the  optimal  trade-off  curve  and  the 
set  of  achievable  values  of  the  bi-criterion  problem. 

6.8  Formulate  the  following  robust  approximation  problems  as  LPs,  QPs,  SOCPs,  or  SDPs. 
For  each  subproblem,  consider  the  £1-,  £2-,  and  the  £00-norms. 

(a)  Stochastic  robust  approximation  with  a  finite  set  of  parameter  values,  i.e.,  the  sum- 
of-norms  problem 

minimize  Y^i=i  Pi  II  AiX  -  6|| 
where  pbO  and  1  Tp  =  1.  (See  §6.4.1.) 

(b)  Worst-case  robust  approximation  with  coefficient  bounds: 

minimize  sup^g^  \\Ax  —  6|| 

where 

A  =  {A  £  Rmxn  |  Uj  <  aij  <Uij ,  i  =  1, . . .  ,m,  j  =  1, . . .  ,n}. 

Here  the  uncertainty  set  is  described  by  giving  upper  and  lower  bounds  for  the 
components  of  A.  We  assume  hj  <  Uij. 

(c)  Worst-case  robust  approximation  with  polyhedral  uncertainty: 

minimize  sup^g^  \\Ax  —  6|| 

where 

A  =  {[ai  •  •  •  am]T  |  CiOi  X  di,  i=  1, . . .  ,m}. 

The  uncertainty  is  described  by  giving  a  polyhedron  Vi  =  {a;  |  CiOi  di}  of  possible 
values  for  each  row.  The  parameters  C;  £  Rp*x™,  di  £  R/'' ,  i  =  1, . . .  ,m,  are  given. 
We  assume  that  the  polyhedra  Vi  are  nonempty  and  bounded. 
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Function  fitting  and  interpolation 

6.9  Minimax  rational  function  fitting.  Show  that  the  following  problem  is  quasiconvex: 


minimize 


max 

i=l,...,k 


p(ti ) 

q{u) 


where 


pit)  —  ao  +  a±t  +  (J2t2  +  ■  ■  ■  +  nmfm,  f/(f)  —  1  ■+■  bi t  +  •  •  •  -t-  bntn , 

and  the  domain  of  the  objective  function  is  defined  as 

D  =  {(a,b)e  Rm+1  x  R"  |  q(t)  >0,  a  <  t  <  fi}. 

In  this  problem  we  fit  a  rational  function  p(t)/q{t)  to  given  data,  while  constraining  the 
denominator  polynomial  to  be  positive  on  the  interval  [a,0\.  The  optimization  variables 
are  the  numerator  and  denominator  coefficients  ai,  bi.  The  interpolation  points  ti  £  [a,  /?], 
and  desired  function  values  yi,  i  =  1, . . . ,  k,  are  given. 

6.10  Fitting  data  with  a  concave  nonnegative  nondecreasing  quadratic  function.  We  are  given 
the  data 

xi, . . . ,  xn  £  Rn,  ?/i, . . . ,  yN  £  R, 
and  wish  to  fit  a  quadratic  function  of  the  form 

fix)  =  (l/2)a :T  Px  +  qT  x  +  r, 

where  P  £  Sn,  q  £  R",  and  r  £  R  are  the  parameters  in  the  model  (and,  therefore,  the 
variables  in  the  fitting  problem). 

Our  model  will  be  used  only  on  the  box  B  =  {x  £  R"  |  l  ^  x  A  it}.  You  can  assume  that 
l  -<  it,  and  that  the  given  data  points  Xi  are  in  this  box. 

We  will  use  the  simple  sum  of  squared  errors  objective, 

N 

'Y^ifixi)  ~  Vi)2 , 

i=l 

as  the  criterion  for  the  fit.  We  also  impose  several  constraints  on  the  function  /.  First, 
it  must  be  concave.  Second,  it  must  be  nonnegative  on  B,  i.e.,  /(«)  >  0  for  all  z  £  B. 
Third,  /  must  be  nondecreasing  on  B,  i.e.,  whenever  z,  z  £  B  satisfy  z  <  z,  we  have 
/0)  <  fiz). 

Show  how  to  formulate  this  fitting  problem  as  a  convex  problem.  Simplify  your  formula¬ 
tion  as  much  as  you  can. 

6.11  Least-squares  direction  interpolation.  Suppose  Fi,...,Fn  :  Rfc  — >  Rp,  and  we  form  the 
linear  combination  F  :  Rfc  — >  Rp, 

Fiu)  =  xiFiiu)  H - b  x„F„(«), 

where  x  is  the  variable  in  the  interpolation  problem. 

In  this  problem  we  require  that  Z(F’( Vj),  qj)  —  0,  j  =  1, . . . ,  m,  where  qj  are  given  vectors 
in  Rp,  which  we  assume  satisfy  \\qj\\2  =  1-  In  other  words,  we  require  the  direction  of 
F  to  take  on  specified  values  at  the  points  Vj.  To  ensure  that  F (v.j )  is  not  zero  (which 
makes  the  angle  undefined),  we  impose  the  minimum  length  constraints  ||77’(ti_7) ||2  >  e, 
j  =  1, ...  ,m,  where  e  >  0  is  given. 

Show  how  to  find  x  that  minimizes  ||as || 2 ,  and  satisfies  the  direction  (and  minimum  length) 
conditions  above,  using  convex  optimization. 

6.12  Interpolation  with  monotone  functions.  A  function  /  :  Rfc  — >  R  is  monotone  nondecreas¬ 
ing  (with  respect  to  R+)  if  fiu)  >  fiv)  whenever  u  >z  v. 
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(a)  Show  that  there  exists  a  monotone  nondecreasing  function  /  :  Rfc  — »  R,  that  satisfies 
f(ui)  ==*yi  for  *  =  1, . . . ,  m,  if  and  only  if 

Vi  >  Vj  whenever  Ui>Uj,  i,  j  =  1, ,  m. 

(b)  Show  that  there  exists  a  convex  monotone  nondecreasing  function  /  :  Rfc  — >  R,  with 
dom  /  =  Rfc,  that  satisfies  f(ui)  =  yt  for  i  =  1 ,m,  if  and  only  if  there  exist 
gi  £  Rfc,  i  =  1, . . .  ,m,  such  that 

gih  0,  i  =  yj  >yi+ gT{uj -Ui),  i,j  =  l,...,m. 

6.13  Interpolation  with  quasiconvex  functions.  Show  that  there  exists  a  quasiconvex  function 
/  :  Rfc  -»•  R,  that  satishes  f(ui)  =  yt  for  *  =  1, . . . ,  m,  if  and  only  if  there  exist  gi  £  Rfe, 
i  =  1, . . .  ,m,  such  that 

gj ( Uj  -  m )  <  -1  whenever  yj  <  yt,  i,  j  =  1, ...  ,m. 

6.14  [NesOO]  Interpolation  with  positive-real  functions.  Suppose  zi, . . .  ,zn  £  C  are  n  distinct 
points  with  \zi\  >  1.  We  define  Knp  as  the  set  of  vectors  y  £  C"  for  which  there  exists  a 
function  /  :  C  — >  C  that  satisfies  the  following  conditions. 

•  /  is  positive-real ,  which  means  it  is  analytic  outside  the  unit  circle  (ie.,  for  \z\  >  1), 
and  its  real  part  is  nonnegative  outside  the  unit  circle  (IRf(z)  >  0  for  \z\  >  1). 

•  /  satishes  the  interpolation  conditions 

f(zi)  =  yi,  f(z2)  =  y2,  ■■■,  f(zn)=yn. 


If  we  denote  the  set  of  positive-real  functions  as  T,  then  we  can  express  Knp  as 
A'np  =  {y€Cn\3f€Jr,  yk  =  f(zk),  k  =  l,...,n}. 

(a)  It  can  be  shown  that  /  is  positive-real  if  and  only  if  there  exists  a  nondecreasing 
function  p  such  that  for  all  z  with  \z\  >  1, 

f2n  eie  +  r-1 

f(z)  =i$sf(oo)+  __  dp(9), 

Jo  e  z 

where  i  =  \/—l  (see  [KN77,  page  389]).  Use  this  representation  to  show  that  Knp 
is  a  closed  convex  cone. 

(b)  We  will  use  the  inner  product  Ht{xHy)  between  vectors  x,  y  £  C™,  where  xH  denotes 
the  complex  conjugate  transpose  of  x.  Show  that  the  dual  cone  of  Knp  is  given  by 


KP  =  <  x  £  C" 


Q(lTx)  =  0,  K 


^  e~w  + z-1 
/  Xl~ 


'J—r\  >Q\/d£\Q.  2tt1 


(c)  Show  that 


KP  =  £  C” 


3Q  £  H™,  xi=^T 


Qkl 


1  —  zu  z, 


l  =  1. ....  n 


where  H"  denotes  the  set  of  positive  semidehnite  Hermitian  matrices  of  size  n  x  n. 
Use  the  following  result  (known  as  Riesz-Fejer  theorem ;  see  [KN77,  page  60]).  A 
function  of  the  form 

n 

Et  —ik9  .  —  ik0\ 

fyke  +  yke  ) 

fc=0 
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is  nonnegative  for  all  6  if  and  only  if  there  exist  ao,...,a„  £  C  such  that 

n  n  2 

E/  —ikG  |  —  ikO\  \  ^  ikd 

(Vke  +yke  )=  )  afce 

k= 0  k= 0 

(d)  Show  that  Knp  =  {y  G  Cn  |  P(y )  >:  0}  where  P(?/)  G  Hn  is  defined  as 

-  -  Vk  ,  l,k=l,...,n. 

1~zk  zi 

The  matrix  P(y)  is  called  the  Nevanlinna-Pick  matrix  associated  with  the  points 
Zk,  Vk- 

Hint.  As  we  noted  in  part  (a),  A'np  is  a  closed  convex  cone,  so  A'np  =  A'**. 

(e)  As  an  application,  pose  the  following  problem  as  a  convex  optimization  problem: 

minimize  Y^2=i\f(Zk')  ~  Wk\2 
subject  to  /  £  T . 

The  problem  data  are  n  points  Zk  with  \zk\  >  1  and  n  complex  numbers  wi,  . . . , 
wn.  We  optimize  over  all  positive-real  functions  /. 


Chapter  7 

Statistical  estimation 


7.1  Parametric  distribution  estimation 

7.1.1  Maximum  likelihood  estimation 

We  consider  a  family  of  probability  distributions  on  Rm ,  indexed  by  a  vector 
x  £  Rra,  with  densities  px(-).  When  considered  as  a  function  of  x,  for  fixed  y  £  Rm, 
the  function  px(y)  is  called  the  likelihood,  function.  It  is  more  convenient  to  work 
with  its  logarithm,  which  is  called  the  log-likelihood  function ,  and  denoted  l: 

l(x)  =  log  px(y). 

There  are  often  constraints  on  the  values  of  the  parameter  x,  which  can  repre¬ 
sent  prior  knowledge  about  x,  or  the  domain  of  the  likelihood  function.  These 
constraints  can  be  explicitly  given,  or  incorporated  into  the  likelihood  function  by 
assigning  px{y)  =  0  (for  all  y)  whenever  x  does  not  satisfy  the  prior  information 
constraints.  (Thus,  the  log-likelihood  function  can  be  assigned  the  value  — oo  for 
parameters  x  that  violate  the  prior  information  constraints.) 

Now  consider  the  problem  of  estimating  the  value  of  the  parameter  x,  based 
on  observing  one  sample  y  from  the  distribution.  A  widely  used  method,  called 
maximum  likelihood  (ML)  estimation,  is  to  estimate  x  as 

xmi  =  avgmaxxpx(y)  =  argrna  xxl(x), 

i.e.,  to  choose  as  our  estimate  a  value  of  the  parameter  that  maximizes  the  like¬ 
lihood  (or  log-likelihood)  function  for  the  observed  value  of  y.  If  we  have  prior 
information  about  x,  such  as  x  £  C  C  Rn,  we  can  add  the  constraint  x  £  C 
explicitly,  or  impose  it  implicitly,  by  redefining  px(y)  to  be  zero  for  x  ^  C. 

The  problem  of  finding  a  maximum  likelihood  estimate  of  the  parameter  vector 
x  can  be  expressed  as 

maximize  l(x)  =  \ogpx(y)  ,  , 

subject  to  x  £  C,  \  ■  > 

where  x  £  C  gives  the  prior  information  or  other  constraints  on  the  parameter 
vector  x.  In  this  optimization  problem,  the  vector  x  £  R"  (which  is  the  parameter 
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in  the  probability  density)  is  the  variable,  and  the  vector  y  £  Rm  (which  is  the 
observed  sample)  is  a  problem  parameter. 

The  maximum  likelihood  estimation  problem  (7.1)  is  a  convex  optimization 
problem  if  the  log-likelihood  function  l  is  concave  for  each  value  of  y,  and  the  set 
C  can  be  described  by  a  set  of  linear  equality  and  convex  inequality  constraints,  a 
situation  which  occurs  in  many  estimation  problems.  For  these  problems  we  can 
compute  an  ML  estimate  using  convex  optimization. 

Linear  measurements  with  I  ID  noise 

We  consider  a  linear  measurement  model, 

yi  =  afx  +  Vi,  i  =  1, . . . ,  m, 

where  x  £  R"  is  a  vector  of  parameters  to  be  estimated,  £  R  are  the  measured 
or  observed  quantities,  and  v>i  are  the  measurement  errors  or  noise.  We  assume 
that  Vi  are  independent,  identically  distributed  (HD),  with  density  p  on  R.  The 
likelihood  function  is  then 

m 

Px(y)  =  Y[p(yz  -ajx), 

i=  1 

so  the  log- likelihood  function  is 

m 

l(x)  =  log px(y)  =  ^log p{yi  -afx). 

i= 1 

The  ML  estimate  is  any  optimal  point  for  the  problem 

maximize  YhLi  logp(t/i  ~  afx),  (7.2) 

with  variable  x.  If  the  density  p  is  log-concave,  this  problem  is  convex,  and  has  the 
form  of  a  penalty  approximation  problem  ((6.2),  page  294),  with  penalty  function 
-logp. 


Example  7.1  ML  estimation  for  some  common  noise  densities. 

•  Gaussian  noise.  When  vt  are  Gaussian  with  zero  mean  and  variance  a2,  the 
density  is  p(z)  =  (27rcr2)_1/2e_z  ^2<T  ,  and  the  log-likelihood  function  is 

l(x)  =  -(m/2)  logger2)  -  ^ || Ax  -  ?/||l, 

where  A  is  the  matrix  with  rows  aj , . . . ,  a^.  Therefore  the  ML  estimate  of 
x  is  Xml  =  argmin^,  ||Ar  —  y ||2,  the  solution  of  a  least-squares  approximation 
problem. 

•  Laplacian  noise.  When  Vi  are  Laplacian,  i.e.,  have  density  p(z)  =  {l/2a)e~^z^a 
(where  a  >  0),  the  ML  estimate  is  x  =  argmkq  || Ax  —  y||i,  the  solution  of  the 
G-norm  approximation  problem. 

•  Uniform  noise.  When  m  are  uniformly  distributed  on  [—a,  a],  we  have  p(z)  = 
1/(2 a)  on  [—a,  a],  and  an  ML  estimate  is  any  x  satisfying  ||Tx  —  i/Hoo  <  a. 
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ML  interpretation  of  penalty  function  approximation 

Conversely,  we  can  interpret  any  penalty  function  approximation  problem 

minimize  ~  aTx ) 

as  a  maximum  likelihood  estimation  problem,  with  noise  density 

e-<Kz) 

P{Z^  =  fe-tW  du  ’ 

and  measurements  b.  This  observation  gives  a  statistical  interpretation  of  the 
penalty  function  approximation  problem.  Suppose,  for  example,  that  the  penalty 
function  <j>  grows  very  rapidly  for  large  values,  which  means  that  we  attach  a  very 
large  cost  or  penalty  to  large  residuals.  The  corresponding  noise  density  function 
p  will  have  very  small  tails,  and  the  ML  estimator  will  avoid  (if  possible)  estimates 
with  any  large  residuals  because  these  correspond  to  very  unlikely  events. 

We  can  also  understand  the  robustness  of  td-norm  approximation  to  large  errors 
in  terms  of  maximum  likelihood  estimation.  We  interpret  fi-norm  approximation 
as  maximum  likelihood  estimation  with  a  noise  density  that  is  Laplacian;  t^-norm 
approximation  is  maximum  likelihood  estimation  with  a  Gaussian  noise  density. 
The  Laplacian  density  has  larger  tails  than  the  Gaussian,  i.e.,  the  probability  of  a 
very  large  vt  is  far  larger  with  a  Laplacian  than  a  Gaussian  density.  As  a  result, 
the  associated  maximum  likelihood  method  expects  to  see  greater  numbers  of  large 
residuals. 

Counting  problems  with  Poisson  distribution 

In  a  wide  variety  of  problems  the  random  variable  y  is  nonnegative  integer  valued, 
with  a  Poisson  distribution  with  mean  y  >  0: 

prob(y  =  k)  =  —  . 

Often  y  represents  the  count  or  number  of  events  (such  as  photon  arrivals,  traffic 
accidents,  etc.)  of  a  Poisson  process  over  some  period  of  time. 

In  a  simple  statistical  model,  the  mean  y  is  modeled  as  an  affine  function  of  a 
vector  u  €  Rn: 

y  =  aTu  +  b. 

Here  u  is  called  the  vector  of  explanatory  variables,  and  the  vector  a  £  R"  and 
number  b  £  R  are  called  the  model  parameters.  For  example,  if  y  is  the  number 
of  traffic  accidents  in  some  region  over  some  period,  U\  might  be  the  total  traffic 
flow  through  the  region  during  the  period,  U2  the  rainfall  in  the  region  during  the 
period,  and  so  on. 

We  are  given  a  number  of  observations  which  consist  of  pairs  ('«; ,  yt ) ,  i  = 
1, . . . ,  to,  where  yi  is  the  observed  value  of  y  for  which  the  value  of  the  explanatory 
variable  is  ut  £  R".  Our  job  is  to  find  a  maximum  likelihood  estimate  of  the  model 
parameters  a  £  R"  and  b  £  R  from  these  data. 
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The  likelihood  function  has  the  form 


-q  (aTUi  +  b)Vi  exp(— (aTitj  +  b )) 

i= 1 


Vi 


■  \ 


so  the  log-likelihood  function  is 


l(a,  b)  =  log(aTrtj  +  b)  -  (aT«j  +  b)  -  log^!)). 

i= 1 

We  can  find  an  ML  estimate  of  a  and  b  by  solving  the  convex  optimization  problem 
maximize  log (aTUi  +  b)  -  ( aTu.i  +  b)), 


where  the  variables  are  a  and  b. 


Logistic  regression 

We  consider  a  random  variable  y  £  {0, 1},  with 

prob(y  =  1)  =  p,  prob(y  =  0)  =  1  -  p, 


where  p  £  [0,1],  and  is  assumed  to  depend  on  a  vector  of  explanatory  variables 
u  £  Rn.  For  example,  y  =  1  might  mean  that  an  individual  in  a  population  acquires 
a  certain  disease.  The  probability  of  acquiring  the  disease  is  p ,  which  is  modeled 
as  a  function  of  some  explanatory  variables  u,  which  might  represent  weight,  age, 
height,  blood  pressure,  and  other  medically  relevant  variables. 

The  logistic  model  has  the  form 


exp(aTu  +  b) 

1  +  exp(aTu  +  b)  ’ 


(7.3) 


where  a  £  R"  and  b  £  R  are  the  model  parameters  that  determine  how  the 
probability  p  varies  as  a  function  of  the  explanatory  variable  u. 

Now  suppose  we  are  given  some  data  consisting  of  a  set  of  values  of  the  explana¬ 
tory  variables  iti , . . , ,  um  £  R™  along  with  the  corresponding  outcomes  y\ , . . . ,  ym  £ 
{0, 1}.  Our  job  is  to  find  a  maximum  likelihood  estimate  of  the  model  parameters 
a  £  R”  and  b  £  R.  Finding  an  ML  estimate  of  a  and  b  is  sometimes  called  logistic 
regression. 

We  can  re-order  the  data  so  for  ui,...,uq,  the  outcome  is  y  =  1,  and  for 
uq+ 1, . . . ,  um  the  outcome  is  y  =  0.  The  likelihood  function  then  has  the  form 


q  m 

i=  1  i=q-\- 1 


where  pi  is  given  by  the  logistic  model  with  explanatory  variable  ut .  The  log- 
likelihood  function  has  the  form 

q  m 

l{a,b)  =  £l°gft+  log(1-K) 

i—  1  i—q-\-l 
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Figure  7.1  Logistic  regression.  The  circles  show  50  points  ( Ui,yi ),  where 
Mi  €  R.  is  the  explanatory  variable,  and  i/i  £  {0, 1}  is  the  outcome.  The 
data  suggest  that  for  u  <  5  or  so,  the  outcome  is  more  likely  to  be  y  =  0, 
while  for  u  >  5  or  so,  the  outcome  is  more  likely  to  be  y  =  1.  The  data 
also  suggest  that  for  u  <  2  or  so,  the  outcome  is  very  likely  to  be  y  =  0, 
and  for  m  >  8  or  so,  the  outcome  is  very  likely  to  be  i/  =  1.  The  solid 
curve  shows  prob(y  =  1)  =  exp(au  +  6)/(l  +  exp(aw  +  6))  for  the  maximum 
likelihood  parameters  a,  b.  This  maximum  likelihood  model  is  consistent 
with  our  informal  observations  about  the  data  set. 


,  ,  exp(aTrtj  +  b)  °  1 

°8  1  +  exp (aTUi  +  b)  ^  °8  1  +  exp(aTi(j  +  b) 

i— 1  i=q+ 1 

q  m 

=  ^2(aTUi  +  b)  -  ^2  l°g(l  +  exp(aT^  +  b)). 

i= 1  i—1 

Since  l  is  a  concave  function  of  a  and  b ,  the  logistic  regression  problem  can  be  solved 
as  a  convex  optimization  problem.  Figure  7.1  shows  an  example  with  u  £  R. 

Covariance  estimation  for  Gaussian  variables 

Suppose  y  £  R"  is  a  Gaussian  random  variable  with  zero  mean  and  covariance 
matrix  R  =  E  yyT ,  so  its  density  is 

Pr(v)  =  (27t)-11/2  det(l?)-1/2  exp(—yTR~1y/2), 

where  R  £  S"  +  .  We  want  to  estimate  the  covariance  matrix  R  based  on  N  in¬ 
dependent  samples  j/i, . . .  ,j/jv  £  R"  drawn  from  the  distribution,  and  using  prior 
knowledge  about  R. 

The  log-likelihood  function  has  the  form 


l(R.)  =  logpR(y!,...,yN) 
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N 

=  —(Nn/2)  log(27r)  -  (N/2)  log  det  R  -  (1/2  )^ylRT1yk 

k= 1 

=  —(Nn/2)  log(27r)  —  (N/2)  log  det  R  —  (N/2)  tr(i?-1F), 

where 

y  y^vl 

*:= i 

is  the  sample  covariance  of  j/i, . . . ,  yjy.  This  log-likelihood  function  is  not  a  concave 
function  of  R  (although  it  is  concave  on  a  subset  of  its  domain  S"  +  ;  see  exercise  7.4), 
but  a  change  of  variable  yields  a  concave  log-likelihood  function.  Let  S  denote  the 
inverse  of  the  covariance  matrix,  S  =  R_1  (which  is  called  the  information  matrix). 
Using  S  in  place  of  R  as  a  new  parameter,  the  log-likelihood  function  has  the  form 

l(S)  =  -(Nn/2)  log(27r)  +  (N/2)  log  det  S  -  (N/2)  tr  (SY), 

which  is  a  concave  function  of  S. 

Therefore  the  ML  estimate  of  S  (hence,  R)  is  found  by  solving  the  problem 

maximize  log  det  S  —  tr(ST)  ,  . 

subject  to  S  £  S  f' 

where  S  is  our  prior  knowledge  of  S  =  i?  .  (We  also  have  the  implicit  constraint 
that  S  £  S"  .)  Since  the  objective  function  is  concave,  this  is  a  convex  problem 
if  the  set  S  can  be  described  by  a  set  of  linear  equality  and  convex  inequality 
constraints. 

First  we  examine  the  case  in  which  no  prior  assumptions  are  made  on  R  (hence, 
S),  other  than  Ry  0.  In  this  case  the  problem  (7.4)  can  be  solved  analytically.  The 
gradient  of  the  objective  is  S~1  —  Y,  so  the  optimal  S  satisfies  S'-1  =YifY  £  S"  +  . 
(If  Y  ^  S++j  the  log-likelihood  function  is  unbounded  above.)  Therefore,  when 
we  have  no  prior  assumptions  about  R ,  the  maximum  likelihood  estimate  of  the 
covariance  is,  simply,  the  sample  covariance:  i?mi  =  Y . 

Now  we  consider  some  examples  of  constraints  on  R  that  can  be  expressed  as 
convex  constraints  on  the  information  matrix  S.  We  can  handle  lower  and  upper 
(matrix)  bounds  on  R,  of  the  form 


LY  RYU, 

where  L  and  U  are  symmetric  and  positive  definite,  as 

U -1  Y  R-1  Y  IT1. 

A  condition  number  constraint  on  R, 

•^ma x(R)  —  ^max^min  (7^)  , 

can  be  expressed  as 

^max^)  A  ftmax'^min^)- 
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This  is  equivalent  to  the  existence  of  u  >  0  such  that  ul  A  S  A  nma xul.  We  can 
therefore  solve  the  ML  problem,  with  the  condition  number  constraint  on  R,  by 
solving  the  convex  problem 

maximize  log  det  S  —  tr  (SY )  ,  . 

subject  to  ul  A  S  A  Krna xV,I  ' 

where  the  variables  are  S  £  S"  and  u  £  R. 

As  another  example,  suppose  we  are  given  bounds  on  the  variance  of  some  linear 
functions  of  the  underlying  random  vector  y, 

E(cfy)2  <  au  i  =  l,...,K. 

These  prior  assumptions  can  be  expressed  as 

E(cfy)2  =  cfRci  =  cf  S'_1ci  <au  i  =  1, . . . ,  K. 

Since  cfSl_1ci  is  a  convex  function  of  S  (provided  S  >-  0,  which  holds  here),  these 
bounds  can  be  imposed  in  the  ML  problem. 


7.1.2  Maximum  a  posteriori  probability  estimation 

Maximum  a  posteriori  probability  (MAP)  estimation  can  be  considered  a  Bayesian 
version  of  maximum  likelihood  estimation,  with  a  prior  probability  density  on  the 
underlying  parameter  x.  We  assume  that  x  (the  vector  to  be  estimated)  and  y  (the 
observation)  are  random  variables  with  a  joint  probability  density  p(x,y).  This 
is  in  contrast  to  the  statistical  estimation  setup,  where  a;  is  a  parameter,  not  a 
random  variable. 

The  prior  density  of  x  is  given  by 

Px{x)  =  [ p(x, y)  dy. 


This  density  represents  our  prior  information  about  what  the  values  of  the  vector  x 
might  be,  before  we  observe  the  vector  y.  Similarly,  the  prior  density  of  y  is  given 

by 

Py(y)  =  J  P{x,y)  dx. 

This  density  represents  the  prior  information  about  what  the  measurement  or  ob¬ 
servation  vector  y  will  be. 

The  conditional  density  of  y ,  given  x,  is  given  by 

/  \  P{x,y) 

p*{x'v)  =  tm" 

In  the  MAP  estimation  method,  py\x  plays  the  role  of  the  parameter  dependent 
density  px  in  the  maximum  likelihood  estimation  setup.  The  conditional  density 
of  x,  given  y,  is  given  by 


Px\v(x,y) 


p{x,y) 

pv{y) 


=  Py\x{x,y) 


Px{x) 

PyivY 
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When  we  substitute  the  observed  value  y  into  px\y,  we  obtain  the  posterior  density 
of  x.  It  represents  our  knowledge  of  x  after  the  observation. 

In  the  MAP  estimation  method,  our  estimate  of  x,  given  the  observation  y,  is 
given  by 


VaP  =  argmaxa.px|j/(a;,  y) 

=  argma,xxpv\x(x,y)px(x) 

=  argma  xxp(x,y). 

In  other  words,  we  take  as  estimate  of  x  the  value  that  maximizes  the  conditional 
density  of  x,  given  the  observed  value  of  y.  The  only  difference  between  this 
estimate  and  the  maximum  likelihood  estimate  is  the  second  term,  px  (x) ,  appearing 
here.  This  term  can  be  interpreted  as  taking  our  prior  knowledge  of  x  into  account. 
Note  that  if  the  prior  density  of  x  is  uniform  over  a  set  C,  then  finding  the  MAP 
estimate  is  the  same  as  maximizing  the  likelihood  function  subject  to  x  £  C ,  which 
is  the  ML  estimation  problem  (7.1). 

Taking  logarithms,  we  can  express  the  MAP  estimate  as 

£maP  =  argmaxx(logp3/|x(a;,  y)  +  logp^x)).  (7.6) 

The  first  term  is  essentially  the  same  as  the  log-likelihood  function;  the  second 
term  penalizes  choices  of  x  that  are  unlikely,  according  to  the  prior  density  (he.,  x 
"with  px(x)  small). 

Brushing  aside  the  philosophical  differences  in  setup,  the  only  difference  between 
finding  the  MAP  estimate  (via  (7.6))  and  the  ML  estimate  (via  (7.1))  is  the  presence 
of  an  extra  term  in  the  optimization  problem,  associated  with  the  prior  density  of 
x.  Therefore,  for  any  maximum  likelihood  estimation  problem  with  concave  log- 
likelihood  function,  we  can  add  a  prior  density  for  x  that  is  log-concave,  and  the 
resulting  MAP  estimation  problem  will  be  convex. 

Linear  measurements  with  I  ID  noise 

Suppose  that  x  £  R"  and  y  £  Rm  are  related  by 

Vi  =  afx  +  Vi,  i  =  l,...,m, 

where  Vi  are  IID  with  density  pv  on  R,  and  x  has  prior  density  px  on  R™.  The 
joint  density  of  x  and  y  is  then 


p(x,y)  =px  (x)  pv  (yi  -ajx), 

i= 1 

and  the  MAP  estimate  can  be  found  by  solving  the  optimization  problem 

maximize  log  px  (x)  +  YT=i  1o§  Pv  (Vi  ~  a,Jx) .  (7.7) 

If  px  and  pv  are  log-concave,  this  problem  is  convex.  The  only  difference  between 
the  MAP  estimation  problem  (7.7)  and  the  associated  ML  estimation  problem  (7.2) 
is  the  extra  term  log px(x). 
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For  example,  if  V{  are  uniform  on  [—a,  a],  and  the  prior  distribution  of  x  is 
Gaussian  with  mean  x  and  covariance  £,  the  MAP  estimate  is  found  by  solving 
the  QP 

minimize  [x  —  x)TE_1(a:  —  x) 
subject  to  || Ax  —  y ||oo  <  a, 

with  variable  x. 

MAP  with  perfect  linear  measurements 

Suppose  x  G  Rra  is  a  vector  of  parameters  to  be  estimated,  with  prior  density 
px.  We  have  m  perfect  (noise  free,  deterministic)  linear  measurements,  given  by 
y  =  Ax.  In  other  words,  the  conditional  distribution  of  y ,  given  x ,  is  a  point  mass 
with  value  one  at  the  point  Ax.  The  MAP  estimate  can  be  found  by  solving  the 
problem 

maximize  logp^a;) 
subject  to  Ax  =  y. 

If  px  is  log-concave,  this  is  a  convex  problem. 

If  under  the  prior  distribution,  the  parameters  Xi  are  IID  with  density  p  on  R, 
then  the  MAP  estimation  problem  has  the  form 

maximize  X]"=i  l°g p(xi) 
subject  to  Ax  =  y, 

which  is  a  least-penalty  problem  ((6.6),  page  304),  with  penalty  function  4>(u)  = 
-log  p(u). 

Conversely,  we  can  interpret  any  least-penalty  problem, 

minimize  (j>  (x  i  )  +  •••  +  <j>{xn) 
subject  to  Ax  =  b 

as  a  MAP  estimation  problem,  with  m  perfect  linear  measurements  (i.e.,  Ax  =  b) 
and  Xi  IID  with  density 

e-HA 

~  f  e~^u)  du 


7.2  Nonparametric  distribution  estimation 

We  consider  a  random  variable  X  with  values  in  the  finite  set  {op, . . . ,  an}  C  R. 
(We  take  the  values  to  be  in  R  for  simplicity;  the  same  ideas  can  be  applied  when 
the  values  are  in  Rfe,  for  example.)  The  distribution  of  X  is  characterized  by 
p  G  R",  with  prob(A'  =  a*,)  =  pu-  Clearly,  p  satisfies  p  >  0,  1  Tp  =  1.  Conversely, 
if  p  G  R"  satisfies  p  ^  0,  1  Tp  =  1,  then  it  defines  a  probability  distribution  for  a 
random  variable  X ,  defined  as  prob(AT  =  ctk)  =  Pk-  Thus,  the  probability  simplex 


{p  G  R"  |  p  y  0,  1  Tp  =  1} 
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is  in  one-to-one  correspondence  with  all  possible  probability  distributions  for  a 
random  variable  X  taking  values  in  {aq, . . . ,  an}. 

In  this  section  we  discuss  methods  used  to  estimate  the  distribution  p  based  on 
a  combination  of  prior  information  and,  possibly,  observations  and  measurements. 


Prior  information 


Many  types  of  prior  information  about  p  can  be  expressed  in  terms  of  linear  equality 
constraints  or  inequalities.  If  /  :  R  — >  R  is  any  function,  then 

n 

E  f{X)  =  Y,Pif(<Xi) 

i=l 

is  a  linear  function  of  p.  As  a  special  case,  if  C  C  R,  then  prob(X  £  C)  is  a  linear 
function  of  p: 

prob  (X£C)  =  cTp,  Ci  =  {j  21  C. 

It  follows  that  known  expected  values  of  certain  functions  ( e.g .,  moments)  or  known 
probabilities  of  certain  sets  can  be  incorporated  as  linear  equality  constraints  on 
p  £  R".  Inequalities  on  expected  values  or  probabilities  can  be  expressed  as  linear 
inequalities  on  p  £  R" . 

For  example,  suppose  we  know  that  X  has  mean  E  X  =  a,  second  moment 
EX2  =  /3,  and  prob(AT  >  0)  <  0.3.  This  prior  information  can  be  expressed  as 

n  n 

EI  =  ^  atpi  =  a,  EI2  =  ^  a2pt  =  fj,  ^  pt  <  0.3, 

2—1  i—1  oti>  0 


which  are  two  linear  equalities  and  one  linear  inequality  in  p. 

We  can  also  include  some  prior  constraints  that  involve  nonlinear  functions  of 
p.  As  an  example,  the  variance  of  X  is  given  by 

n  /  n 

var(I)=ET2-(EX)2=^  a2p,  -  (  ^  a iPi 

i—l  \i=l 


The  first  term  is  a  linear  function  of  p  and  the  second  term  is  concave  quadratic 
in  p,  so  the  variance  of  X  is  a  concave  function  of  p.  It  follows  that  a  lower  bound 
on  the  variance  of  X  can  be  expressed  as  a  convex  quadratic  inequality  on  p. 

As  another  example,  suppose  A  and  B  are  subsets  of  R,  and  consider  the 
conditional  probability  of  A  given  B: 


prob(AT  £  A\X  £  B) 


prob(X  £  A  fl  B) 
prob(A!  £  B ) 


This  function  is  linear-fractional  in  p  £  Rn:  it  can  be  expressed  as 


prob(A!  £  A\X  £  B)  ==  cTp/dTp , 


where 

f  1  Oi-i  £  A  fl  B  ,  f  1  OL\  £  B 

Ci  =  |  0  a, :(jL  A  n  B  ’  di  =  \  0  ai(£B. 
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Therefore  we  can  express  the  prior  constraints 

l  <  prob(X  £  A\X  £  B)  <  u 
as  the  linear  inequality  constraints  on  p 

ldTp  <  cTp  <  udTp. 

Several  other  types  of  prior  information  can  be  expressed  in  terms  of  nonlinear 
convex  inequalities.  For  example,  the  entropy  of  X ,  given  by 

n 

-5>  logp?;’ 

i=  1 

is  a  concave  function  of  p,  so  we  can  impose  a  minimum  value  of  entropy  as  a  convex 
inequality  on  p.  If  q  represents  another  distribution,  i.e.,  q  tz  0,  1  Tq  =  1,  then 
the  Kullback-Leibler  divergence  between  the  distribution  q  and  the  distribution  p 
is  given  by 

n 

'Y^Pi\og(pi/qi), 

i= 1 

which  is  convex  in  p  (and  q  as  well;  see  example  3.19,  page  90).  It  follows  that 
we  can  impose  a  maximum  Kullback-Leibler  divergence  between  p  and  a  given 
distribution  q,  as  a  convex  inequality  on  p. 

In  the  next  few  paragraphs  we  express  the  prior  information  about  the  distribu¬ 
tion  p  asp  £  V.  We  assume  that  V  can  be  described  by  a  set  of  linear  equalities  and 
convex  inequalities.  We  include  in  the  prior  information  V  the  basic  constraints 
p  >z  0,  1  Tp  =  1. 

Bounding  probabilities  and  expected  values 

Given  prior  information  about  the  distribution,  say  p  £  V,  we  can  compute  upper 
or  lower  bounds  on  the  expected  value  of  a  function,  or  probability  of  a  set.  For 
example  to  determine  a  lower  bound  on  E  f(X)  over  all  distributions  that  satisfy 
the  prior  information  p  £  V,  we  solve  the  convex  problem 

minimize  D  7=  l  /  (ai  )Pi 

subject  to  p  £  V. 

Maximum  likelihood  estimation 

We  can  use  maximum  likelihood  estimation  to  estimate  p  based  on  observations 
from  the  distribution.  Suppose  we  observe  N  independent  samples  x\, . . . ,  Xn  from 
the  distribution.  Let  ki  denote  the  number  of  these  samples  with  value  a,;,  so  that 
ki  +  ■  ■  ■  +  kn  =  N,  the  total  number  of  observed  samples.  The  log-likelihood 
function  is  then 

n 

Kp)  =  ^2ki  log  Pi; 

i= 1 
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which  is  a  concave  function  of  p.  The  maximum  likelihood  estimate  of  p  can  be 
found  by  solving  the  convex  problem 

maximize  l(p)  =  X)"=i  log  Pi 
subject  to  p  G  V, 


with  variable  p. 

Maximum  entropy 

The  maximum  entropy  distribution  consistent  with  the  prior  assumptions  can  be 
found  by  solving  the  convex  problem 

minimize  D  "=  i  Pi  log  Pi 

subject  to  p  €  V. 

Enthusiasts  describe  the  maximum  entropy  distribution  as  the  most  equivocal  or 
most  random,  among  those  consistent  with  the  prior  information. 

Minimum  Kullback-Leibler  divergence 

We  can  find  the  distribution  p  that  has  minimum  Kullback-Leibler  divergence  from 
a  given  prior  distribution  q,  among  those  consistent  with  prior  information,  by 
solving  the  convex  problem 

minimize  J2  "=  1  Pi  log  ( Pi  / q% ) 

subject  to  p  €  V, 

Note  that  when  the  prior  distribution  is  the  uniform  distribution,  i.e.,  q  =  (l/n)l, 
this  problem  reduces  to  the  maximum  entropy  problem. 


Example  7.2  We  consider  a  probability  distribution  on  100  equidistant  points  cm  in 
the  interval  [—1, 1],  We  impose  the  following  prior  assumptions: 


EX  £ 
EX2  € 
E(3X3  -  2X)  € 

prob(X  <  0)  £ 


[-0.1, 0.1] 
[0.5,  0.6] 
[—0.3,  —0.2] 
[0.3, 0.4], 


(7.8) 


Along  with  the  constraints  lTp  =  1,  p  X  0,  these  constraints  describe  a  polyhedron 
of  probability  distributions. 

Figure  7.2  shows  the  maximum  entropy  distribution  that  satisfies  these  constraints. 
The  maximum  entropy  distribution  satisfies 

EX  =  0.056 

EX2  =  0.5 

E(3X3  -  2X)  =  -0.2 

prob(X  <  0)  =  0.4. 


To  illustrate  bounding  probabilities,  we  compute  upper  and  lower  bounds  on  the 
cumulative  distribution  prob(X  <  an),  for  i  =  1, . . . ,  100.  For  each  value  of  i, 
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Figure  7.2  Maximum  entropy  distribution  that  satisfies  the  constraints  (7.8). 


we  solve  two  LPs:  one  that  maximizes  prob(_Y  <  a;),  and  one  that  minimizes 
prob(X  <  on),  over  all  distributions  consistent  with  the  prior  assumptions  (7.8). 
The  results  are  shown  in  figure  7.3.  The  upper  and  lower  curves  show  the  upper  and 
lower  bounds,  respectively;  the  middle  curve  shows  the  cumulative  distribution  of  the 
maximum  entropy  distribution. 


Example  7.3  Bounding  risk  probability  with  known  marginal  distributions.  Suppose  X 
and  Y  are  two  random  variables  that  give  the  return  on  two  investments.  We  assume 
that  X  takes  values  in  {ai, . . . ,  a„}  C  R  and  Y  takes  values  in  {/3i, . . . ,  /3m}  C  R, 
with  pij  =  prob(X  =  ai,Y  =  fij).  The  marginal  distributions  of  the  two  returns  X 
and  Y  are  known,  i.e., 

m  n 

^ ~2pij=n ,  i  =  l,...,n,  ^2  Pij  =  <h ,  j  =  (7.9) 

3=1  i=l 

but  otherwise  nothing  is  known  about  the  joint  distribution  p.  This  defines  a  poly¬ 
hedron  of  joint  distributions  consistent  with  the  given  marginals. 

Now  suppose  we  make  both  investments,  so  our  total  return  is  the  random  variable 
X  +  Y.  We  are  interested  in  computing  an  upper  bound  on  the  probability  of  some 
level  of  loss,  or  low  return,  i.e.,  prob(X  +  Y  <  7).  We  can  compute  a  tight  upper 
bound  on  this  probability  by  solving  the  LP 

maximize  ^  {pij  |  ai  +  fa  <  7} 

subject  to  (7.9),  Pij  >  0,  i  —  1 ....  n,  j  =  l,...,m. 

The  optimal  value  of  this  LP  is  the  maximum  probability  of  loss.  The  optimal 
solution  p*  is  the  joint  distribution,  consistent  with  the  given  marginal  distributions, 
that  maximizes  the  probability  of  the  loss. 

The  same  method  can  be  applied  to  a  derivative  of  the  two  investments.  Let  R(X,  Y) 
be  the  return  of  the  derivative,  where  R  :  R2  — >  R.  We  can  compute  sharp  lower 
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Figure  7.3  The  top  and  bottom  curves  show  the  maximum  and  minimum 
possible  values  of  the  cumulative  distribution  function,  prob(X  <  ai),  over 
all  distributions  that  satisfy  (7.8).  The  middle  curve  is  the  cumulative  dis¬ 
tribution  of  the  maximum  entropy  distribution  that  satisfies  (7.8). 


and  upper  bounds  on  prob(7?  <  7)  by  solving  a  similar  LP,  with  objective  function 

^{Pij  |  R(pti,  pj)  <  7}, 
which  we  can  minimize  and  maximize. 


7.3  Optimal  detector  design  and  hypothesis  testing 

Suppose  X  is  a  random  variable  with  values  in  {1, . . . ,  n},  with  a  distribution  that 
depends  on  a  parameter  9  £  {1, . . . ,  m}.  The  distributions  of  X,  for  the  m  possible 
values  of  9,  can  be  represented  by  a  matrix  P  £  Rnxm,  with  elements 

pkj  =  prob(X  =  k\  9  =  j). 

The  jth  column  of  P  gives  the  probability  distribution  associated  with  the  param¬ 
eter  value  9  =  j. 

We  consider  the  problem  of  estimating  9,  based  on  an  observed  sample  of  X.  In 
other  words,  the  sample  X  is  generated  from  one  of  the  m  possible  distributions, 
and  we  are  to  guess  which  one.  The  m  values  of  9  are  called  hypotheses,  and  guessing 
which  hypothesis  is  correct  ( i.e .,  which  distribution  generated  the  observed  sample 
X)  is  called  hypothesis  testing.  In  many  cases  one  of  the  hypotheses  corresponds 
to  some  normal  situation,  and  each  of  the  other  hypotheses  corresponds  to  some 
abnormal  event.  In  this  case  hypothesis  testing  can  be  interpreted  as  observing  a 
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value  of  X ,  and  then  guessing  whether  or  not  an  abnormal  event  has  occurred,  and 
if  so,  which  one.  For  this  reason  hypothesis  testing  is  also  called  detection. 

In  most  cases  there  is  no  significance  to  the  ordering  of  the  hypotheses;  they  are 
simply  m  different  hypotheses,  arbitrarily  labeled  9  =  1, ...  ,m.  If  9  =  9,  where  9 
denotes  the  estimate  of  9 ,  then  we  have  correctly  guessed  the  parameter  value  9.  If 
9^9,  then  we  have  (incorrectly)  guessed  the  parameter  value  9 ;  we  have  mistaken 
9  for  9.  In  other  cases,  there  is  significance  in  the  ordering  of  the  hypotheses.  In  this 
case,  an  event  such  as  9  >  9,  i.e.,  the  event  that  we  overestimate  9,  is  meaningful. 

It  is  also  possible  to  parametrize  9  by  values  other  than  {1, . . . ,  to},  say  as  9  £ 
{i 9i , . . . ,  9m },  where  9i  are  (distinct)  values.  These  values  could  be  real  numbers,  or 
vectors,  for  example,  specifying  the  mean  and  variance  of  the  fcth  distribution.  In 
this  case,  a  quantity  such  as  ||0  — 0||,  which  is  the  norm  of  the  parameter  estimation 
error,  is  meaningful. 


7.3.1  Deterministic  and  randomized  detectors 

A  (deterministic)  estimator  or  detector  is  a  function  ij)  from  {1, . . .  ,n}  (the  set  of 
possible  observed  values)  into  (1, . . . ,  m}  (the  set  of  hypotheses).  If  X  is  observed 
to  have  value  k.  then  our  guess  for  the  value  of  9  is  9  =  if)(k).  One  obvious 
deterministic  detector  is  the  maximum  likelihood  detector ,  given  by 

9  =  V’mi  (fc)  =  argmax  pkj .  (7-10) 

3 

When  we  observe  the  value  X  =  fc.  the  maximum  likelihood  estimate  of  9  is  a 
value  that  maximizes  the  probability  of  observing  X  =  k,  over  the  set  of  possible 
distributions. 

We  will  consider  a  generalization  of  the  deterministic  detector,  in  which  the 
estimate  of  9 ,  given  an  observed  value  of  X ,  is  random.  A  randomized  detector 
of  9  is  a  random  variable  9  £  {1, . . . ,  to},  with  a  distribution  that  depends  on  the 
observed  value  of  X.  A  randomized  detector  can  be  defined  in  terms  of  a  matrix 
T  £  Rmx"  with  elements 

tik  =  prob(0  =  i  |  X  =  k). 

The  interpretation  is  as  follows:  if  we  observe  X  =  k,  then  the  detector  gives  9  =  i 
with  probability  tik ■  The  fcth  column  of  T,  which  we  will  denote  tk,  gives  the 
probability  distribution  of  9,  when  we  observe  X  =  k.  If  each  column  of  T  is  a 
unit  vector,  then  the  randomized  detector  is  a  deterministic  detector,  i.e.,  9  is  a 
(deterministic)  function  of  the  observed  value  of  X. 

At  first  glance,  it  seems  that  intentionally  introducing  additional  randomiza¬ 
tion  into  the  estimation  or  detection  process  can  only  make  the  estimator  worse. 
But  we  will  see  below  examples  in  which  a  randomized  detector  outperforms  all 
deterministic  estimators. 

We  are  interested  in  designing  the  matrix  T  that  defines  the  randomized  detec¬ 
tor.  Obviously  the  columns  tk  of  T  must  satisfy  the  (linear  equality  and  inequality) 
constraints 


tk  h  0, 


1  Ttk  =  1. 


(7.11) 


366 


7  Statistical  estimation 


7.3.2  Detection  probability  matrix 

For  the  randomized  detector  defined  by  the  matrix  T,  we  define  the  detection 
probability  matrix  as  D  =  TP.  We  have 

Di:j  =  ( TP)ij  =  prob(f?  =  i\9  =  j), 

so  Dij  is  the  probability  of  guessing  0  =  i,  when  in  fact  9  =  j.  The  m  x  m 
detection  probability  matrix  D  characterizes  the  performance  of  the  randomized 
detector  defined  by  T.  The  diagonal  entry  Du  is  the  probability  of  guessing  6  =  i 
when  9  =  i,  i.e.,  the  probability  of  correctly  detecting  that  9  =  i.  The  off-diagonal 
entry  (with  i  ^  j)  is  the  probability  of  mistaking  6  =  i  for  6  =  j,  i.e.,  the 
probability  that  our  guess  is  9  =  i,  when  in  fact  9  =  j.  If  D  =  /,  the  detector  is 
perfect:  no  matter  what  the  parameter  6  is,  we  correctly  guess  9  =  9. 

The  diagonal  entries  of  D ,  arranged  in  a  vector,  are  called  the  detection  proba¬ 
bilities,  and  denoted  Pd: 

Pd  =  Du  =  prob(f?  =  i\  9  =  i). 

The  error  probabilities  are  the  complements,  and  are  denoted  Pe: 

P.i  =1  —  Dri  =  prob(f?  ^  i  |  9  =  i). 

Since  the  columns  of  the  detection  probability  matrix  D  add  up  to  one,  we  can 
express  the  error  probabilities  as 


Pi=J2D^- 


7.3.3  Optimal  detector  design 

In  this  section  we  show  that  a  wide  variety  of  objectives  for  detector  design  are 
linear,  affine,  or  convex  piecewise-linear  functions  of  D,  and  therefore  also  of  T 
(which  is  the  optimization  variable).  Similarly,  a  variety  of  constraints  for  detector 
design  can  be  expressed  in  terms  of  linear  inequalities  in  D.  It  follows  that  a  wide 
variety  of  optimal  detector  design  problems  can  be  expressed  as  LPs.  We  will  see 
in  §7.3.4  that  some  of  these  LPs  have  simple  solutions;  in  this  section  we  simply 
formulate  the  problem. 

Limits  on  errors  and  detection  probabilities 

We  can  impose  a  lower  bound  on  the  probability  of  correctly  detecting  the  jth 
hypothesis, 

Pj  =  Pjj  —  Pj  5 

which  is  a  linear  inequality  in  D  (hence,  T).  Similarly,  we  can  impose  a  maximum 
allowable  probability  for  mistaking  9  =  i  for  9  =  j: 
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which  are  also  linear  constraints  on  T .  We  can  take  any  of  the  detection  prob¬ 
abilities  as  an  objective  to  be  maximized,  or  any  of  the  error  probabilities  as  an 
objective  to  be  minimized. 

Minimax  detector  design 

We  can  take  as  objective  (to  be  minimized)  the  minimax  error  probability,  maxj  P®, 
which  is  a  piecewise- linear  convex  function  of  D  (hence,  also  of  T).  With  this  as 
the  only  objective,  we  have  the  problem  of  minimizing  the  maximum  probability 
of  detection  error, 


minimize  maxj  P® 

subject  to  tk  >r  0,  1  Ttk  =  1,  k  =  1, . . . ,  n, 

where  the  variables  are  ti, ...  ,tn  £  Rm.  This  can  be  reformulated  as  an  LP.  The 
minimax  detector  minimizes  the  worst-case  (largest)  probability  of  error  over  all  m 
hypotheses. 

We  can,  of  course,  add  further  constraints  to  the  minimax  detector  design  prob¬ 
lem. 

Bayes  detector  design 

In  Bayes  detector  design,  we  have  a  prior  distribution  for  the  hypotheses,  given  by 
q  £  Rm,  where 

qi  =  prob(d  =  i). 

In  this  case,  the  probabilities  p.tj  are  interpreted  as  conditional  probabilities  of  X, 
given  6.  The  probability  of  error  for  the  detector  is  then  given  by  qT P®,  which  is 
an  affine  function  of  T.  The  Bayes  optimal  detector  is  the  solution  of  the  LP 

minimize  qTPe 

subject  to  tk  h  0,  1  Ttk  =  1,  k  =  1, . . . ,  n. 

We  will  see  in  §7.3.4  that  this  problem  has  a  simple  analytical  solution. 

One  special  case  is  when  q  =  (l/m)l.  In  this  case  the  Bayes  optimal  detector 
minimizes  the  average  probability  of  error,  where  the  (unweighted)  average  is  over 
the  hypotheses.  In  §7.3.4  we  will  see  that  the  maximum  likelihood  detector  (7.10) 
is  optimal  for  this  problem. 

Bias,  mean-square  error,  and  other  quantities 

In  this  section  we  assume  that  the  ordering  of  the  values  of  9  have  some  significance, 
i.e.,  that  the  value  9  =  i  can  be  interpreted  as  a  larger  value  of  the  parameter  than 
9  =  j,  when  i  >  j.  This  might  be  the  case,  for  example,  when  9  =  i  corresponds  to 
the  hypothesis  that  i  events  have  occurred.  Here  we  may  be  interested  in  quantities 
such  as 

prob(f?  >  9  |  9  =  i), 

which  is  the  probability  that  we  overestimate  9  when  9  =  i.  This  is  an  affine 
function  of  D: 

prob(f?  >  9  |  9  =  i)  =  ^  Dji, 

j>i 
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so  a  maximum  allowable  value  for  this  probability  can  be  expressed  as  a  linear 
inequality  on  D  (hence,  T).  As  another  example,  the  probability  of  misclassifying 
9  by  more  than  one,  when  9  =  i, 

prob(|0  —  9\  >  1  |  9  =  i)  =  E  DJt, 


is  also  a  linear  function  of  D. 

We  now  suppose  that  the  parameters  have  values  {0i, . . .  ,9m}  C  R.  The  es¬ 
timation  or  detection  (parameter)  error  is  then  given  by  9  —  9,  and  a  number  of 
quantities  of  interest  are  given  by  linear  functions  of  D.  Examples  include: 

•  Bias.  The  bias  of  the  detector,  when  0  =  0*,  is  given  by  the  linear  function 

m 

E(0  -  9)  =  Z2(0j  -  9i)Dji, 

3= 1 

where  the  subscript  on  E  means  the  expectation  is  with  respect  to  the  dis¬ 
tribution  of  the  hypothesis  9  =  9i. 

•  Mean  square  error.  The  mean  square  error  of  the  detector,  when  9  =  9i,  is 
given  by  the  linear  function 

m 

E(9  -  9)2  =  E^i  -  OifDji. 

3  =  1 

•  Average  absolute  error.  The  average  absolute  error  of  the  detector,  when 
9  =  9i,  is  given  by  the  linear  function 

m 

v\e-9\  =  YJ\0j-0i\Dji- 

3~  1 

7.3.4  Multicriterion  formulation  and  scalarization 

The  optimal  detector  design  problem  can  be  considered  a  multicriterion  problem, 
with  the  constraints  (7.11),  and  the  m{m  —  1)  objectives  given  by  the  off-diagonal 
entries  of  D ,  which  are  the  probabilities  of  the  different  types  of  detection  error: 

minimize  (w.r.t.  K+  )  Vij,  i,  j  =  i  ^  j  (7  12) 

subject  to  tk  h  0,  lTtk  =  1,  k  =  l,...,n, 

with  variables  ti, ...  ,tn  G  Rm.  Since  each  objective  Dl3  is  a  linear  function  of  the 
variables,  this  is  a  multicriterion  linear  program. 

We  can  scalarize  this  multicriterion  problem  by  forming  the  weighted  sum  ob¬ 
jective 

m 

E  WB A;  =  tr (WtD) 

i,3= 1 
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where  the  weight  matrix  W  £  Rmxm  satisfies 

Wu  =  0,  i  =  1, . . . ,  m,  Wij  >0,  i,  j  =  1, . . . ,  m,  j. 

This  objective  is  a  weighted  sum  of  the  m(m  —  1)  error  probabilities,  with  weight 
Wij  associated  with  the  error  of  guessing  9  =  i  when  in  fact  9  =  j.  The  weight 
matrix  is  sometimes  called  the  loss  matrix. 

To  find  a  Pareto  optimal  point  for  the  multicriterion  problem  (7.12),  we  form 
the  scalar  optimization  problem 

minimize  tr  {WT  D)  ,  , 

subject  to  tk  >r  0,  1  Ttk  =  1,  k  =  1, . . . ,  n,  ' 

which  is  an  LP.  This  LP  is  separable  in  the  variables  t\, . . . ,  tn.  The  objective  can 
be  expressed  as  a  sum  of  (linear)  functions  of  tk'- 

n 

tr  {WtD)  =  tr  (WtTP)  =  tr  {PWtT)  = 

fc= l 

where  Ck  is  the  fcth  column  of  WPT .  The  constraints  are  separable  (i.e.,  we  have 
separate  constraints  on  each  ti).  Therefore  we  can  solve  the  LP  (7.13)  by  separately 
solving 

minimize  tk 

subject  to  tk  h  0,  1  Ttk  =  1, 

for  k  =  1  Each  of  these  LPs  has  a  simple  analytical  solution  (see  exer¬ 

cise  4.8).  We  first  find  an  index  q  such  that  Ckq  =  mim,-  Ckj  ■  Then  we  take  =  eq. 
This  optimal  point  corresponds  to  a  deterministic  detector:  when  X  =  k  is  ob¬ 
served,  our  estimate  is 

9  =  argmin  (WPT)jk-  (7-14) 

3 

Thus,  for  every  weight  matrix  W  with  positive  off-diagonal  elements  we  can  find 
a  deterministic  detector  that  minimizes  the  weighted  sum  objective.  This  seems 
to  suggest  that  randomized  detectors  are  not  needed,  but  we  will  see  this  is  not 
the  case.  The  Pareto  optimal  trade-off  surface  for  the  multicriterion  LP  (7.12)  is 
piecewise- linear;  the  deterministic  detectors  of  the  form  (7.14)  correspond  to  the 
vertices  on  the  Pareto  optimal  surface. 

MAP  and  ML  detectors 

Consider  a  Bayes  detector  design  with  prior  distribution  q.  The  mean  probability 
of  error  is 

m  m 

QTpe  =  E  Dn  =  E  "w 

3=1  i^3  iJ=! 

if  we  define  the  weight  matrix  W  as 

Wij  =  qj:  i,  j  =  1, . . . ,  to,  i  ^  j ,  Wu  =  0,  i  =  1, . . . ,  m. 
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Thus,  a  Bayes  optimal  detector  is  given  by  the  deterministic  detector  (7.14),  with 

m 

( WPT)jk  =  '^2  QiPki  =  22  qiPki  ~  qiPkr 

i^j  *  =  i 

The  first  term  is  independent  of  j,  so  the  optimal  detector  is  simply 


9  =  argrna x(pkjqj), 
j 

when  X  =  k  is  observed.  The  solution  has  a  simple  interpretation:  Since  PkjQj 
gives  the  probability  that  9  =  j  and  X  =  k,  this  detector  is  a  maximum  a  posteriori 
probability  (MAP)  detector. 

For  the  special  case  q  =  (l/m)l,  i.e.,  a  uniform  prior  distribution  on  9,  this 
MAP  detector  reduces  to  a  maximum  likelihood  (ML)  detector: 

9  =  argmaxp/y. 
j 

Thus,  a  maximum  likelihood  detector  minimizes  the  (unweighted)  average  or  mean 
probability  of  error. 


7.3.5  Binary  hypothesis  testing 


As  an  illustration,  we  consider  the  special  case  in  =  2,  which  is  called  binary 
hypothesis  testing.  The  random  variable  X  is  generated  from  one  of  two  distribu¬ 
tions,  which  we  denote  p  £  R"  and  q  £  R",  to  simplify  the  notation.  Often  the 
hypothesis  9  =  1  corresponds  to  some  normal  situation,  and  the  hypothesis  9  =  2 
corresponds  to  some  abnormal  event  that  we  are  trying  to  detect.  If  9  =  1,  we  say 
the  test  is  negative  (i.e.,  we  guess  that  the  event  did  not  occur);  if  9  =  2,  we  say 
the  test  is  positive  (i.e.,  we  guess  that  the  event  did  occur). 

The  detection  probability  matrix  D  £  R2x2  is  traditionally  expressed  as 


1  —  Pfp  Pfn 
-Pfp  1  —  -Pfn 


Here  Pfn  is  the  probability  of  a  false  negative  (i.e.,  the  test  is  negative  when  in  fact 
the  event  has  occurred)  and  Pfp  is  the  probability  of  a  false  positive  (i.e.,  the  test 
is  positive  when  in  fact  the  event  has  not  occurred),  which  is  also  called  the  false 
alarm  probability.  The  optimal  detector  design  problem  is  a  bi-criterion  problem, 
with  objectives  Pfn  and  Pfp. 

The  optimal  trade-off  curve  between  Pfn  and  Pfp  is  called  the  receiver  operating 
characteristic  (ROC),  and  is  determined  by  the  distributions  p  and  q.  The  ROC 
can  be  found  by  scalarizing  the  bi-criterion  problem,  as  described  in  §7.3.4.  For 
the  weight  matrix  W,  an  optimal  detector  (7.14)  is 


9  = 


1  W2lPk  >  W\2qk 

2  W2\Pk  <  W\2qk 
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Figure  7.4  Optimal  trade-off  curve  between  probability  of  a  false  negative, 
and  probability  of  a  false  positive  test  result,  for  the  matrix  P  given  in  (7.15). 
The  vertices  of  the  trade-off  curve,  labeled  1  3,  correspond  to  deterministic 
detectors;  the  point  labeled  4,  which  is  a  randomized  detector,  is  the  mini¬ 
max  detector.  The  dashed  line  shows  Pfn  =  P{p,  the  points  where  the  error 
probabilities  are  equal. 


when  X  =  k  is  observed.  This  is  called  a  likelihood  ratio  threshold  test :  if  the 
ratio  Pk/qk  is  more  than  the  threshold  W12/W21,  the  test  is  negative  (i.e.,  9  = 
1);  otherwise  the  test  is  positive.  By  choosing  different  values  of  the  threshold, 
we  obtain  (deterministic)  Pareto  optimal  detectors  that  give  different  levels  of 
false  positive  versus  false  negative  error  probabilities.  This  result  is  known  as 
the  Ney man- Pearson  lemma. 

The  likelihood  ratio  detectors  do  not  give  all  the  Pareto  optimal  detectors;  they 
are  the  vertices  of  the  optimal  trade-off  curve,  which  is  piecewise-linear. 


Example  7.4  We  consider  a  binary  hypothesis  testing  example  with  n  =  4,  and 


'  0.70  0.10  " 

0.20  0.10 

0.05  0.70 

0.05  0.10 


(7.15) 


The  optimal  trade-off  curve  between  Pfn  and  Pfp,  i.e.,  the  receiver  operating  curve, 
is  shown  in  figure  7.4.  The  left  endpoint  corresponds  to  the  detector  which  is  always 
negative,  independent  of  the  observed  value  of  X-  the  right  endpoint  corresponds  to 
the  detector  that  is  always  positive.  The  vertices  labeled  1,  2,  and  3  correspond  to 
the  deterministic  detectors 
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rp(3) 


10  0  0 
0  111’ 


respectively.  The  point  labeled  4  corresponds  to  the  nondeterministic  detector 


'pU)  _ 


12/300 

01/311 


which  is  the  minimax  detector.  This  minimax  detector  yields  equal  probability  of 
a  false  positive  and  false  negative,  which  in  this  case  is  1/6.  Every  deterministic 
detector  has  either  a  false  positive  or  false  negative  probability  that  exceeds  1/6, 
so  this  is  an  example  where  a  randomized  detector  outperforms  every  deterministic 
detector. 


7.3.6  Robust  detectors 


So  far  we  have  assumed  that  P,  which  gives  the  distribution  of  the  observed  variable 
X ,  for  each  value  of  the  parameter  9,  is  known.  In  this  section  we  consider  the  case 
where  these  distributions  are  not  known,  but  certain  prior  information  about  them 
is  given.  We  assume  that  P  £  V,  where  V  is  the  set  of  possible  distributions.  With 
a  randomized  detector  characterized  by  T,  the  detection  probability  matrix  D  now 
depends  on  the  particular  value  of  P.  We  will  judge  the  error  probabilities  by 
their  worst-case  values,  over  P  £  V.  We  define  the  worst-case  detection  probability 
matrix  Dwc  as 

D™c  =  sup  Di:j,  i,  j  =  1, . . . ,  m,  i/  j 
p&v 

and 

AT  =  Da,  i  =  l,...,m. 

The  off-diagonal  entries  give  the  largest  possible  probability  of  errors,  and  the 
diagonal  entries  give  the  smallest  possible  probability  of  detection,  over  P  £  V. 
Note  that  Y^i=i  ^Tj'  1  in  general,  i.e.,  the  columns  of  a  worst-case  detection 
probability  matrix  do  not  necessarily  add  up  to  one. 

We  define  the  worst-case  probability  of  error  as 


pwee  __  2  _  JJV 


Thus,  P/vce  is  the  largest  probability  of  error,  when  9  =  i,  over  all  possible  distri¬ 
butions  in  V . 

Using  the  worst-case  detection  probability  matrix,  or  the  worst-case  probability 
of  error  vector,  we  can  develop  various  robust  versions  of  detector  design  problems. 
In  the  rest  of  this  section  we  concentrate  on  the  robust  minimax  detector  design 
problem,  as  a  generic  example  that  illustrates  the  ideas. 

We  define  the  robust  minimax  detector  as  the  detector  that  minimizes  the  worst- 
case  probability  of  error,  over  all  hypotheses,  i.e.,  minimizes  the  objective 


maxPjWce  =  max  sup  (1  —  ( TP)u )  =  1  —  min  inf  ( TP)u . 

i  p(=/p 

The  robust  minimax  detector  minimizes  the  worst  possible  probability  of  error, 
over  all  m  hypotheses,  and  over  all  P  £  V . 
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Robust  minimax  detector  for  finite  V 

When  the  set  of  possible  distributions  is  finite,  the  robust  minimax  detector  design 
problem  is  readily  formulated  as  an  LP.  With  V  =  {Pi, . . .  ,Pk},  we  can  find  the 
robust  minimax  detector  by  solving 

maximize  mini=li.,.iTO  inf  PG-p  (TP)  a  =  mini=li...>m  min j=i,...,k{TPj)u 

subject  to  ti  >z  0,  lTti  =  1,  i  =  1, . . . ,  n, 

The  objective  is  piecewise-linear  and  concave,  so  this  problem  can  be  expressed  as 
an  LP.  Note  that  we  can  just  as  well  consider  V  to  be  the  polyhedron  convP; 
the  associated  worst-case  detection  matrix,  and  robust  minimax  detector,  are  the 
same. 

Robust  minimax  detector  for  polyhedral  V 

It  is  also  possible  to  efficiently  formulate  the  robust  minimax  detector  problem  as  an 
LP  when  V  is  a  polyhedron  described  by  linear  equality  and  inequality  constraints. 
This  formulation  is  less  obvious,  and  relies  on  a  dual  representation  of  V . 

To  simplify  the  discussion,  we  assume  that  V  has  the  form 

V  =  {P  =  \Pi  ■  ■  ■  Pm]  |  Akpk  =  bk ,  1  Tpk  =  1,  pk  h  0}  •  (7.16) 

In  other  words,  for  each  distribution  pk.  we  are  given  some  expected  values  Akpk  = 
bk.  (These  might  represent  known  moments,  probabilities,  etc.)  The  extension  to 
the  case  where  we  are  given  inequalities  on  expected  values  is  straightforward. 
The  robust  minimax  design  problem  is 

maximize  7 

subject  to  inf {tfp  \  Aip  =  bi,  1 T p  =  1,  p  >7  0}  >  7,  i  =  1, . . . ,  m 

Uh  0,  lTti  =  1,  i  =  1, . . . ,  n, 

where  if  denotes  the  ith  row  of  T  (so  that  (TP) a  =  tjpi).  By  LP  duality, 

inf {t[ p  |  A.^  =  bu  1  lp  =  1,  p  >z  0}  =  sup{uTbi  +  n  \  Af  v  +  pA  -<  fj. 

Using  this,  the  robust  minimax  detector  design  problem  can  be  expressed  as  the 
LP 

maximize  7 

subject  to  vj bi  +  p,i  >  7,  i  =  1, . . .  ,m 
Af  Vi  +  pA  A  ti,  i  =  l,...,m 
U  y  0,  1 TU  =  1,  i  =  1, . . .  ,n, 

with  variables  v\,. . . ,  vm,  p\, . . . ,  pn,  and  T  (which  has  columns  U  and  rows  if). 


Example  7.5  Robust  binary  hypothesis  testing.  Suppose  m  =  2  and  the  set  V  in  (7.16) 
is  defined  by 


II 

II 

(N 

II 

ai  a2  •  •  •  an 
22  2 

,  ^  = 

ai 

,  b2  = 

'  Pi  ' 

CL  1  CL  2  '  '  '  CLn 

0.2 

P2 

Designing  a  robust  minimax  detector  for  this  set  V  can  be  interpreted  as  a  binary 
hypothesis  testing  problem:  based  on  an  observation  of  a  random  variable  X  £ 
{ai, . . . ,  a„},  choose  between  the  following  two  hypotheses: 
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1.  E  X  =  tx i,  E  X2  —  (x 2 

2.  EX  =  Pi,  EX2  =  p2. 

Let  t T  denote  the  first  row  of  T  (and  so,  (1  —  t)T  is  the  second  row).  For  given  t,  the 
worst-case  probabilities  of  correct  detection  are 


D'ii  =  inf  lip 


E< 


CLiPi  —  ^1 
i=  1  i=  1 


y,  CLiPi  =  a2,  1TP  =  1,  V  >L  0 


Dt. 2  =  inf  <  (1  -  t)Tp 


T  CLiPi  =  pi,  °2iPi  =  ^2,  lTp  =  1,  p  h  0  >  • 


Using  LP  duality  we  can  express  DYi  as  the  optimal  value  of  the  LP 


maximize  zq  +  Z\a.i  +  z2a2 

subject  to  zo  +  CLiZi  +  a2z2  <U,  i  =  1, . . . ,  n, 

with  variables  zo,  zi,  z2  £  R.  Similarly  2  is  the  optimal  value  of  the  LP 

maximize  wo  +  wiPi  +  w2p2 

subject  to  wo  +  a,iWi  +  a2w2  <  1  —  U,  i  =  1, . . . ,  n, 

with  variables  Wo,  Wi,  w2  £  R.  To  obtain  the  minimax  detector,  we  have  to  maximize 
the  minimum  of  DYi  and  D^2,  i.e.,  solve  the  LP 

maximize  7 

subject  to  zo  +  z\<x2  +  z2a2  >  7 
wo  +  P1W1  +  p2w2  >  7 
zo  +  z\ai  +  z2a 2  <  U,  i  = 
wo  +  wi a,i  +  w2a2  <  1  —U,  i  =  1, . . . ,  n 


The  variables  are  zo,  z  1,  z2,  w 0,  wi,  w2  and  t. 


1 A  Chebyshev  and  Chernoff  bounds 

In  this  section  we  consider  two  types  of  classical  bounds  on  the  probability  of  a  set, 
and  show  that  generalizations  of  each  can  be  cast  as  convex  optimization  problems. 
The  original  classical  bounds  correspond  to  simple  convex  optimization  problems 
with  analytical  solutions;  the  convex  optimization  formulation  of  the  general  cases 
allow  us  to  compute  better  bounds,  or  bounds  for  more  complex  situations. 


7.4.1  Chebyshev  bounds 

Chebyshev  bounds  give  an  upper  bound  on  the  probability  of  a  set  based  on  known 
expected  values  of  certain  functions  (e.g.,  mean  and  variance).  The  simplest  ex¬ 
ample  is  Markov’s  inequality:  If  X  is  a  random  variable  on  R+  with  El  =  /i, 
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then  we  have  prob(X  >  1)  <  /z,  no  matter  what  the  distribution  of  X  is.  An¬ 
other  simple  example  is  Chebyshev’s  bound:  If  X  is  a  random  variable  on  R  with 
E X  =  n  and  E(X  —  /z)2  =  a2,  then  we  have  prob(|X  —  /i|  >  1)  <  a2,  again  no 
matter  what  the  distribution  of  X  is.  The  idea  behind  these  simple  bounds  can  be 
generalized  to  a  setting  in  which  convex  optimization  is  used  to  compute  a  bound 
on  the  probability. 

Let  X  be  a  random  variable  on  S  C  Rm,  and  C  C  S  be  the  set  for  which  we 
want  to  bound  prob(X  £  C).  Let  1  c  denote  the  0-1  indicator  function  of  the  set 
C,  i.e.,  1  c{z)  =  1  if  z  €  C  and  1  c(z)  =  0  if  z  C. 

Our  prior  knowledge  of  the  distribution  consists  of  known  expected  values  of 
some  functions: 

E  /)(-^0  —  6&0  ^  —  I?  *  •  •  5  xq 

where  /;  :  Rm  ->  R.  We  take  f0  to  be  the  constant  function  with  value  one,  for 
which  we  always  have  E/o(X)  =  oo  =  1.  Consider  a  linear  combination  of  the 
functions  fi ,  given  by 

n 

f{z )  =^2xifi(z), 

i=0 

where  X;  f  R,  t  =  0,...,n.  From  our  knowledge  of  E  fi(X),  we  have  E  f(X)  = 

T 

a  x. 

Now  suppose  that  /  satisfies  the  condition  f(z )  >  1  c{z)  for  all  z  £  S,  i.e.,  f 
is  pointwise  greater  than  or  equal  to  the  indicator  function  of  C  (on  S).  Then  we 
have 

E  f(X)  =  aT x  >  El C(X)  =  prob(X  £  C). 

In  other  words,  aTx  is  an  upper  bound  on  prob(X  £  C ),  valid  for  all  distributions 
supported  on  S,  with  E  fi(X)  =  m. 

We  can  search  for  the  best  such  upper  bound  on  prob(AT  £  C ),  by  solving  the 
problem 

minimize  xq  +  a\X\  +  •  •  •  +  anxn 

subject  to  f(z)  =  YJi=o  Xifi(z)  >  1  for  z£C  (7.17) 

f  (z )  =  X)"=  0  xifi(z)  >  0  for  z  £  S,  z$.  C, 

with  variable  x  £  Rn+  .  This  problem  is  always  convex,  since  the  constraints  can 
be  expressed  as 

gi{x)  =  1  -  inf  f(z)  <  0,  g2(x)  =  ~  inf  f(z)  <  0 

zee  zes\c 

(gi  and  g2  are  convex).  The  problem  (7.17)  can  also  be  thought  of  as  a  semi-infinite 
linear  program,  i.e.,  an  optimization  problem  with  a  linear  objective  and  an  infinite 
number  of  linear  inequalities,  one  for  each  z  £  S. 

In  simple  cases  we  can  solve  the  problem  (7.17)  analytically.  As  an  example,  we 
take  S  =  R+,  C  =  [l,oo),  fo(z)  =  1,  and  fi(z)  =  z,  with  E/i(AT)  =  EX  =  g  <  1 
as  our  prior  information.  The  constraint  f{z)  >  0  for  2  e  S  reduces  to  xq  >  0, 
X\  >  0.  The  constraint  f{z)  >  1  for  z  £  C,  i.e.,  Xo  +  x\Z  >  1  for  all  z  >  1,  reduces 
to  Xq  +  Xi  >  1.  The  problem  (7.17)  is  then 

minimize  Xq  +  /z  x\ 
subject  to  xq  >  0,  x\  >  0 
Xq  +  XI  >  1. 
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Since  0  <  /i  <  1,  the  optimal  point  for  this  simple  LP  is  xo  =  0,  X\  =  1.  This  gives 
the  classical  Markov  bound  prob(X  >  1)  <  p. 

In  other  cases  we  can  solve  the  problem  (7.17)  using  convex  optimization. 


Remark  7.1  Duality  and  the  Chebyshev  bound  problem.  The  Chebyshev  bound  prob¬ 
lem  (7.17)  determines  a  bound  on  prob(X  £  C )  for  all  probability  measures  that 
satisfy  the  given  expected  value  constraints.  Thus  we  can  think  of  the  Chebyshev 
bound  problem  (7.17)  as  producing  a  bound  on  the  optimal  value  of  the  infinite¬ 
dimensional  problem 

maximize  fc  n(dz) 

subject  to  J  fi(z)n(dz)  =  at,  i  L ..../!  .  . 

1  (7'18) 

7T  >  0, 

where  the  variable  is  the  measure  it,  and  7r  >  0  means  that  the  measure  is  nonnegative. 

Since  the  Chebyshev  problem  (7.17)  produces  a  bound  on  the  problem  (7.18),  it 
should  not  be  a  surprise  that  they  are  related  by  duality.  While  semi-infinite  and 
infinite-dimensional  problems  are  beyond  the  scope  of  this  book,  we  can  still  formally 
construct  a  dual  of  the  problem  (7.17),  introducing  a  Lagrange  multiplier  function 
p  :  S'  — >■  R,  with  p(z)  the  Lagrange  multiplier  associated  with  the  inequality  f(z)  >  1 
(for  z  £  C)  or  /(z)  >0  (for  z  £  S\C).  Using  an  integral  over  z  where  we  would  have 
a  sum  in  the  finite-dimensional  case,  we  arrive  at  the  formal  dual 

maximize  fcP(z )  dz 

subject  to  J  fi(z)p(z)  dz  =  ai,  i  =  1, . . . ,  n 
JgP(z)  dz  =  1 
p(z)  >  0  for  all  z  £  S, 

where  the  optimization  variable  is  the  function  p.  This  is,  essentially,  the  same 
as  (7.18). 


Probability  bounds  with  known  first  and  second  moments 

As  an  example,  suppose  that  S  =  Rm,  and  that  we  are  given  the  first  and  second 
moments  of  the  random  variable  X: 

EX  =  a  £  Rm,  ~EXXt  =  E  £  Sm. 

In  other  words,  we  are  given  the  expected  value  of  the  m  functions  z*,  i  =  1, . . . ,  m, 
and  the  m(m+  l)/2  functions  ZiZj,  i,j  =  1, . . .  ,m,  but  no  other  information  about 
the  distribution. 

In  this  case  we  can  express  /  as  the  general  quadratic  function 

f(z)  =  zTPz  +  2  qTz  +  r, 

where  the  variables  (i.e.,  the  vector  x  in  the  discussion  above)  are  P  £  Sm,  q  £  Rm, 
and  r  £  R.  From  our  knowledge  of  the  first  and  second  moments,  we  find  that 

E  f(X)  =  E(XtPX  +  2qTX  +  r) 

=  E  tr(PXXT)  +  2  E  qTX  +  r 
=  tr(EP)  +  2  qT  a  +  r. 
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The  constraint  that  f(z)  >  0  for  all  z  can  be  expressed  as  the  linear  matrix  in¬ 
equality 

Pt  q  1  >-  0. 
q  r 

In  particular,  we  have  P  >:  0. 

Now  suppose  that  the  set  C  is  the  complement  of  an  open  polyhedron, 

C  =  Rm\P,  P  =  {z\afz<bi,  i  =  l,...,k}. 

The  condition  that  f(z)  >  1  for  all  z  £  C  is  the  same  as  requiring  that 
ajz>bi  =>  zTPz  +  2qTz  +  r>l 


for  i  =  1, . . . ,  k.  This,  in  turn,  can  be  expressed  as:  there  exist  ri, . . . ,  Tfc  >  0  such 
that 

'  P  q 
qT  r  —  1 

(See  §B.2.) 

Putting  it  all  together,  the  Chebyshev  bound  problem  (7.17)  can  be  expressed 
as 


h  n 


a. 


0  ail  2 

T/2  -h 


i  =  1, . . . ,  k. 


minimize 
subject  to 


tr(EP)  +  2  qTa 
P  q 


qT  r  —  1 


r 


af/2 


Ti  >  0,  i  =  1 

P  q 
qT  r 


h  0, 


ail  2 
-hi  ' 


i  =  l,...,k 


(7.19) 


which  is  a  semidefinite  program  in  the  variables  P,  q ,  r,  and  ti,...,7v  The 
optimal  value,  say  a ,  is  an  upper  bound  on  prob(X  €  C)  over  all  distributions 
with  mean  a  and  second  moment  E.  Or,  turning  it  around,  1  —  a  is  a  lower  bound 
on  prob(X  GV). 


Remark  7.2  Duality  and  the  Chebyshev  bound  problem.  The  dual  SDP  associated 
with  (7.19)  can  be  expressed  as 


maximize 
subject  to 


£t  iAi 


Ek 

i= 1 


Zi 

zj  A  i 


>  b\i, 

i  = 

1,.. 

.,k 

Zi 

Zi 

E  a 

1 

zi 

Xi 

aT  1 

Z  0,  i  =  1, . . . ,  k. 


The  variables  are  Zi  £  Sm,  Zi  £  Rm,  and  A i  £  R,  for  i  —  1  Since  the 

SDP  (7.19)  is  strictly  feasible,  strong  duality  holds  and  the  dual  optimum  is  attained. 

We  can  give  an  interesting  probability  interpretation  to  the  dual  problem.  Suppose 
Zi,  Zi,  Xi  are  dual  feasible  and  that  the  first  r  components  of  A  are  positive,  and  the 
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rest  are  zero.  For  simplicity  we  also  assume  that  A*  <  1.  We  define 


Xi  = 

(1/A  i)zi, 

1  ( 

wo  = 

o- 

n 

W  = 

—  (t,  — 

V  \ 

where  /_/  =  1  —  >  Ai.  With  these  definitions  the  dual  feasibility  constraints  can  be 
expressed  as 

ajxi  >  bi,  i  =  1, . . . ,  r 


and 


EA* 


Xi 

1 


+ 1-1 


W  wo 

Wq  1 


Moreover,  from  dual  feasibility, 


E  a 
aT  1 


W  wo 
W(T  i 


>- 


s 

T 


s 

T 


s 

„T 


-EA 

i=  1 
r 

-E 

i=  1 
r 

-E 


XixJ  Xi 
T 

xi 


1 


(1/A  i)zizj  Zi 
zT  Ai 


^i  Zi 

zj  Ai 


>-  0. 


Therefore,  W  >;  woWq  ,  so  it  can  be  factored  as  W  —  woWq  =  'Zli=iWiW? ■  Now 
consider  a  discrete  random  variable  X  with  the  following  distribution.  If  s  >  1,  we 
take 

X  =  Xi  with  probability  Ai,  i  =  1, . . .  ,r 

X  —  wo  +  \fs  Wi  with  probability  fj,/(2s),  i  =  1, . . . ,  s 

X  —  wo  —  \fs  Wi  with  probability  fi/(2s),  i  =  1, . . . ,  s. 

If  s  —  0,  we  take 


X  =  Xi  with  probability  Ai,  i  =  1, . . . ,  r 
X  =  wo  with  probability  /r. 

It  is  easily  verified  that  EX  =  a  and  EA'AT  =  E,  i.e.,  the  distribution  matches  the 
given  moments.  Furthermore,  since  Xi  G  C, 


prob(X  eC)>  Ai. 

i=l 

In  particular,  by  applying  this  interpretation  to  the  dual  optimal  solution,  we  can 
construct  a  distribution  that  satisfies  the  Chebyshev  bound  from  (7.19)  with  equality, 
which  shows  that  the  Chebyshev  bound  is  sharp  for  this  case. 
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7.4.2  Chernoff  bounds 

Let  X  be  a  random  variable  on  R.  The  Chernoff  bound  states  that 

prob(X  >u)<  inf  EeA^Y_“\ 


which  can  be  expressed  as 

log  prob(X  >  it)  <  inf  {—Am  +  logEeA*}.  (7.20) 

Recall  (from  example  3.41,  page  106)  that  the  riglrthand  term,  logEeA'Y,  is  called 
the  cumulant  generating  function  of  the  distribution,  and  is  always  convex,  so  the 
function  to  be  minimized  is  convex.  The  bound  (7.20)  is  most  useful  in  cases  when 
the  cumulant  generating  function  has  an  analytical  expression,  and  the  minimiza¬ 
tion  over  A  can  be  carried  out  analytically. 

For  example,  if  X  is  Gaussian  with  zero  mean  and  unit  variance,  the  cumulant 
generating  function  is 

log  E  eA'Y  =  A2/2, 

and  the  infimum  over  A  >  0  of  —Xu  +  A2/2  occurs  with  A  =  u  (if  u  >  0),  so  the 
Chernoff  bound  is  (for  u  >  0) 


prob(X  >  u)  <  e  “  /2. 

The  idea  behind  the  Chernoff  bound  can  be  extended  to  a  more  general  setting, 
in  which  convex  optimization  is  used  to  compute  a  bound  on  the  probability  of  a 
set  in  Rm.  Let  C  C  Rm,  and  as  in  the  description  of  Chebyshev  bounds  above, 
let  1  c  denote  the  0-1  indicator  function  of  C.  We  will  derive  an  upper  bound  on 
prob(A  £  C).  (In  principle  we  can  compute  prob(Af  £  C ),  for  example  by  Monte 
Carlo  simulation,  or  numerical  integration,  but  either  of  these  can  be  a  daunting 
computational  task,  and  neither  method  produces  guaranteed  bounds.) 

Let  A  £  Rm  and  /.t  £  R,  and  consider  the  function  /  :  Rm  — >  R  given  by 

m  =  exTz+». 

As  in  the  development  of  Chebyshev  bounds,  if  /  satisfies  f(z )  >  1  c(~)  for  all  z, 
then  we  can  conclude  that 

prob(X  £  C)  =  E1C(AT)  <  E  f(X). 

Clearly  we  have  f(z )  >  0  for  all  2;  to  have  f(z)  >  1  for  z  £  C  is  the  same  as 
A T z  +  /i  >  0  for  all  z  £  C,  i.e.,  —A Tz  <  /1  for  all  z  £  C.  Thus,  if  —A Tz  <  /i  for  all 
z  £  C,  we  have  the  bound 

prob(X  £  C)  <  E  exp(Ar  A  +  /j), 


or,  taking  logarithms, 

logprob(A  £  C)  <  n  +  log  E  exp(ATA). 
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From  this  we  obtain  a  general  form  of  Chernoff’s  bound: 

logprob(X  €  C)  <  inf{/z  +  log  E  exp(ATX)  |  —  A Tz  <  /z  for  all  z  £  C} 

=  inf  (  sup(— XT z)  +  log E cxp(ATAl) 

A  \zec 

=  inf  (Sc(— A)  +  logEexp(ATX))  , 

where  Sc  is  the  support  function  of  C.  Note  that  the  second  term,  logEexp(ATX), 
is  the  cumulant  generating  function  of  the  distribution,  and  is  always  convex  (see 
example  3.41,  page  106).  Evaluating  this  bound  is,  in  general,  a  convex  optimiza¬ 
tion  problem. 

Chernoff  bound  for  a  Gaussian  variable  on  a  polyhedron 

As  a  specific  example,  suppose  that  X  is  a  Gaussian  random  vector  on  Rm  with 
zero  mean  and  covariance  I,  so  its  cumulant  generating  function  is 

logEexp(ATA!)  =  ATA/2. 

We  take  C  to  be  a  polyhedron  described  by  inequalities: 

C  =  {x  |  Ax  A  &}, 


which  we  assume  is  nonempty. 

For  use  in  the  Chernoff  bound,  we  use  a  dual  characterization  of  the  support 
function  Sc- 

Sc(y)  =  sup{z/Tx  |  Ax  A  b} 

=  —  inf {—yTx  |  Ax  A  6} 

=  —  sup {—bTu  |  ATu  =  y,  u  y  0} 

=  inf{6Tw  |  Atu  =  y,  u  y  0} 

where  in  the  third  line  we  use  LP  duality: 

inf{cTa;  |  Ax  A  6}  =  sup{— bTu  \  Aru  +  c  =  0,  u  y  0} 

with  c  =  —y.  Using  this  expression  for  Sc  in  the  Chernoff  bound  we  obtain 


log  prob(X  e  C)  < 


inf  (S'c(-A)  +  logEexp(ATX)) 

inf  mi{bTu  +  ATA/2  |  u  >r  0,  ATu  +  A  =  0}. 


Thus,  the  Chernoff  bound  on  prob(X  £  C)  is  the  exponential  of  the  optimal  value 
of  the  QP 

minimize  bTu  +  XT  A/2  ,  . 

subject  to  u  >:  0,  ATu  +  A  =  0, 


where  the  variables  are  u  and  A. 
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This  problem  has  an  interesting  geometric  interpretation.  It  is  equivalent  to 

minimize  bTu  +  (\/2)\\ATu\\\ 
subject  to  u  >z  0, 

which  is  the  dual  of 

maximize  — (1/2)  ||a;||| 
subject  to  Ax  A  b. 

In  other  words,  the  Chernoff  bound  is 

prob(Xl  £  C)  <  exp(~  dist(0,  C)2/2),  (7.22) 

where  dist(0,  C)  is  the  Euclidean  distance  of  the  origin  to  C. 


Remark  7.3  The  bound  (7.22)  can  also  be  derived  without  using  Chernoff’s  inequality. 
If  the  distance  between  0  and  C  is  d ,  then  there  is  a  halfspace  T-L  =  {z  \  aT z  >  d}, 
with  || a|| 2  =  1,  that  contains  C.  The  random  variable  aT X  is  Af( 0, 1),  so 

prob(X  £  C)  <  prob(A'  £  H)  =  $(— d), 

where  is  the  cumulative  distribution  function  of  a  zero  mean,  unit  variance  Gaus¬ 
sian.  Since  <f?(— d)  <  e~d  1/2  for  d  >  0,  this  bound  is  at  least  as  sharp  as  the  Chernoff 
bound  (7.22). 


7.4.3  Example 

In  this  section  we  illustrate  the  Chebyshev  and  Chernoff  probability  bounding 
methods  with  a  detection  example.  We  have  a  set  of  m  possible  symbols  or  signals 
s  £  {si,  S2,  ■  ■  ■ ,  sm}  C  Rra,  which  is  called  the  signal  constellation.  One  of  these 
signals  is  transmitted  over  a  noisy  channel.  The  received  signal  is  £  =  s  +  v, 
where  v  is  a  noise,  modeled  as  a  random  variable.  We  assume  that  Et  =  0  and 
E  vvT  =  a2 1 ,  i.e.,  the  noise  components  vi,...,vn  are  zero  mean,  uncorrelated, 
and  have  variance  er2.  The  receiver  must  estimate  which  signal  was  sent  on  the 
basis  of  the  received  signal  x  =  s  +  v.  The  minimum  distance  detector  chooses  as 
estimate  the  symbol  Sk  closest  (in  Euclidean  norm)  to  x.  (If  the  noise  v  is  Gaussian, 
then  minimum  distance  decoding  is  the  same  as  maximum  likelihood  decoding.) 

If  the  signal  Sk  is  transmitted,  correct  detection  occurs  if  Sk  is  the  estimate, 
given  x.  This  occurs  when  the  signal  Sk  is  closer  to  x  than  the  other  signals,  i.e., 

Ik  —  Sfe||2  <  H*  —  s j  1 1 2 ,  j^k. 

Thus,  correct  detection  of  symbol  Sk  occurs  if  the  random  variable  v  satisfies  the 
linear  inequalities 

2 (sj  Sk)  kfcTu)<||Sjj|2  ||sfc||2,  j  7^  k. 

These  inequalities  define  the  Voronoi  region  14  of  Sk  in  the  signal  constellation, 
i.e.,  the  set  of  points  closer  to  Sk  than  any  other  signal  in  the  constellation.  The 
probability  of  correct  detection  of  Sk  is  prob(sfc  +  v  £  14). 

Figure  7.5  shows  a  simple  example  with  m  =  7  signals,  with  dimension  n  =  2. 
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Figure  7.5  A  constellation  of  7  signals  si, . . . ,  S7  £  R2,  shown  as  small  circles. 
The  line  segments  show  the  boundaries  of  the  corresponding  Voronoi  regions. 
The  minimum  distance  detector  selects  symbol  Sk  when  the  received  signal 
lies  closer  to  Sk  than  to  any  of  the  other  points,  i.e.,  if  the  received  signal  is 
in  the  interior  of  the  Voronoi  region  around  symbol  Sk-  The  circles  around 
each  point  have  radius  one,  to  show  the  scale. 


Chebyshev  bounds 

The  SDP  bound  (7.19)  provides  a  lower  bound  on  the  probability  of  correct  detec¬ 
tion,  and  is  plotted  in  figure  7.6,  as  a  function  of  the  noise  standard  deviation  <7, 
for  the  three  symbols  si,  S2,  and  S3.  These  bounds  hold  for  any  noise  distribution 
with  zero  mean  and  covariance  a2 1.  They  are  tight  in  the  sense  that  there  exists 
a  noise  distribution  with  zero  mean  and  covariance  £  =  a2I,  for  which  the  proba¬ 
bility  of  error  is  equal  to  the  lower  bound.  This  is  illustrated  in  figure  7.7,  for  the 
first  Voronoi  set,  and  <7=1. 

Chernoff  bounds 

We  use  the  same  example  to  illustrate  the  Chernoff  bound.  Here  we  assume  that  the 
noise  is  Gaussian,  i.e.,  v  ~  A7(0,  <72/).  If  symbol  Sk  is  transmitted,  the  probability 
of  correct  detection  is  the  probability  that  +  v  G  14.  To  find  a  lower  bound  for 
this  probability,  we  use  the  QP  (7.21)  to  compute  upper  bounds  on  the  probability 
that  the  ML  detector  selects  symbol  i,  i  =  1, . . . ,  m,  i  ^  k.  (Each  of  these  upper 
bounds  is  related  to  the  distance  of  Sk  to  the  Voronoi  set  V).)  Adding  these  upper- 
bounds  on  the  probabilities  of  mistaking  Sk  for  Sj,  we  obtain  an  upper  bound  on 
the  probability  of  error,  and  therefore,  a  lower  bound  on  the  probability  of  correct 
detection  of  symbol  Sk ■  The  resulting  lower  bound,  for  si,  is  shown  in  figure  7.8, 
along  with  an  estimate  of  the  probability  of  correct  detection  obtained  using  Monte 
Carlo  analysis. 
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Figure  7.6  Chebyshev  lower  bounds  on  the  probability  of  correct  detection 
for  symbols  si,  S2,  and  S3.  These  bounds  are  valid  for  any  noise  distribution 
that  has  zero  mean  and  covariance  a2 1. 


Figure  7.7  The  Chebyshev  lower  bound  on  the  probability  of  correct  detec¬ 
tion  of  symbol  1  is  equal  to  0.2048  when  <7  =  1.  This  bound  is  achieved  by 
the  discrete  distribution  illustrated  in  the  figure.  The  solid  circles  are  the 
possible  values  of  the  received  signal  si  +  v.  The  point  in  the  center  of  the 
ellipse  has  probability  0.2048.  The  five  points  on  the  boundary  have  a  total 
probability  0.7952.  The  ellipse  is  defined  by  xT Px  +  2 qTx  +  r  =  1,  where 
P,  q,  and  r  are  the  optimal  solution  of  the  SDP  (7.19). 
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Figure  7.8  The  Chernoff  lower  bound  (solid  line)  and  a  Monte  Carlo  esti¬ 
mate  (dashed  line)  of  the  probability  of  correct  detection  of  symbol  .Si ,  as 
a  function  of  a.  In  this  example  the  noise  is  Gaussian  with  zero  mean  and 
covariance  a2 1. 


7.5  Experiment  design 

We  consider  the  problem  of  estimating  a  vector  x  £  R"  from  measurements  or 
experiments 

Hi  =  ajx  +  Wi,  i  =  l,...,m, 

where  Wj  is  measurement  noise.  We  assume  that  wt  are  independent  Gaussian 
random  variables  with  zero  mean  and  unit  variance,  and  that  the  measurement 
vectors  a±, ... ,  am  span  R,!.  The  maximum  likelihood  estimate  of  x,  which  is  the 
same  as  the  minimum  variance  estimate,  is  given  by  the  least-squares  solution 

(m  \  ~  1  m 

^2  aid?  J  y ^yiCLj. 

i=l  /  i= 1 

The  associated  estimation  error  e  =  x  —  x  has  zero  mean  and  covariance  matrix 

(m 

at  af 
i=  1 

The  matrix  E  characterizes  the  accuracy  of  the  estimation,  or  the  informativeness 
of  the  experiments.  For  example  the  a- confidence  level  ellipsoid  for  x  is  given  by 

£  =  {z  |  (z  —  x)tE~1(z  —  x)  <  /?}, 

where  /3  is  a  constant  that  depends  on  n  and  a. 

We  suppose  that  the  vectors  a i, . . . ,  am,  which  characterize  the  measurements, 
can  be  chosen  among  p  possible  test  vectors  iq, . . .  ,vp  £  Rra,  i.e.,  each  at  is  one  of 
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the  v j.  The  goal  of  experiment  design  is  to  choose  the  vectors  a*,  from  among  the 
possible  choices,  so  that  the  error  covariance  E  is  small  (in  some  sense).  In  other 
words,  each  of  m  experiments  or  measurements  can  be  chosen  from  a  fixed  menu 
of  p  possible  experiments;  our  job  is  to  find  a  set  of  measurements  that  (together) 
are  maximally  informative. 

Let  nij  denote  the  number  of  experiments  for  which  ai  is  chosen  to  have  the 
value  vj,  so  we  have 

mi  +  ■  ■  ■  +  mp  =  m. 

We  can  express  the  error  covariance  matrix  as 

/  m  \  _1  /  p 

E  =  J  =  |  YlmovivJ 

This  shows  that  the  error  covariance  depends  only  on  the  numbers  of  each  type  of 
experiment  chosen  (be.,  mi, . . . ,  mp). 

The  basic  experiment  design  problem  is  as  follows.  Given  the  menu  of  possible 
choices  for  experiments,  be.,  Vi, . . . ,  vp,  and  the  total  number  m  of  experiments  to 
be  carried  out,  choose  the  numbers  of  each  type  of  experiment,  be.,  mi, . . . , mp, 
to  make  the  error  covariance  E  small  (in  some  sense).  The  variables  mi, . . .  ,mp 
must,  of  course,  be  integers  and  sum  to  m ,  the  given  total  number  of  experiments. 
This  leads  to  the  optimization  problem 

E  =  (Ej=i  mjVjvj) 

nii  >0,  ?7ii  H - +  mp  =  m  (7.23) 

m.j  G  Z, 

where  the  variables  are  the  integers  mi, . . . ,  mp. 

The  basic  experiment  design  problem  (7.23)  is  a  vector  optimization  problem 
over  the  positive  semidefinite  cone.  If  one  experiment  design  results  in  E,  and 
another  in  E,  with  E  ^  E,  then  certainly  the  first  experiment  design  is  as  good 
as  or  better  than  the  second.  For  example,  the  confidence  ellipsoid  for  the  first 
experiment  design  (translated  to  the  origin  for  comparison)  is  contained  in  the 
confidence  ellipsoid  of  the  second.  We  can  also  say  that  the  first  experiment  design 
allows  us  to  estimate  qTx  better  (be.,  with  lower  variance)  than  the  second  experi¬ 
ment  design,  for  any  vector  q1  since  the  variance  of  our  estimate  of  qT x  is  given  by 
qT Eq  for  the  first  experiment  design  and  qT Eq  for  the  second.  We  will  see  below 
several  common  scalarizations  for  the  problem. 


minimize  (w.r.t.  S") 
subject  to 


7.5.1  The  relaxed  experiment  design  problem 

The  basic  experiment  design  problem  (7.23)  can  be  a  hard  combinatorial  problem 
when  m ,  the  total  number  of  experiments,  is  comparable  to  n,  since  in  this  case 
the  nii  are  all  small  integers.  In  the  case  when  to  is  large  compared  to  ?i,  however, 
a  good  approximate  solution  of  (7.23)  can  be  found  by  ignoring,  or  relaxing,  the 
constraint  that  the  to*  are  integers.  Let  A =  rrij/m,  which  is  the  fraction  of 
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the  total  number  of  experiments  for  which  cij  =  Vi,  or  the  relative  frequency  of 
experiment  i.  We  can  express  the  error  covariance  in  terms  of  A as 


m 


XiVivJ 


(7.24) 


The  vector  A  €  Rp  satisfies  A  >  0,  1TA  =  1,  and  also,  each  A,  is  an  integer  multiple 
of  1/m.  By  ignoring  this  last  constraint,  we  arrive  at  the  problem 


minimize  (w.r.t.  S")  E  =  (1/m)  XiVivf) 

subject  to  A  y  0,  1TA  =  1, 


(7.25) 


with  variable  A  £  Rp.  To  distinguish  this  from  the  original  combinatorial  experi¬ 
ment  design  problem  (7.23),  we  refer  to  it  as  the  relaxed  experiment  design  problem. 
The  relaxed  experiment  design  problem  (7.25)  is  a  convex  optimization  problem, 
since  the  objective  E  is  an  S"-convex  function  of  A. 

Several  statements  can  be  made  about  the  relation  between  the  (combinato¬ 
rial)  experiment  design  problem  (7.23)  and  the  relaxed  problem  (7.25).  Clearly 
the  optimal  value  of  the  relaxed  problem  provides  a  lower  bound  on  the  optimal 
value  of  the  combinatorial  one,  since  the  combinatorial  problem  has  an  additional 
constraint.  From  a  solution  of  the  relaxed  problem  (7.25)  we  can  construct  a  sub- 
optimal  solution  of  the  combinatorial  problem  (7.23)  as  follows.  First,  we  apply 
simple  rounding  to  get 


to,  =  round  (mAj),  i  =  1, . . .  ,p. 

Corresponding  to  this  choice  of  mi, . . . ,  mp  is  the  vector  A, 

A i  =  (l/m)round(mAj),  i  =  1, . . .  ,p. 

The  vector  A  satisfies  the  constraint  that  each  entry  is  an  integer  multiple  of  1/m. 
Clearly  we  have  | A,  —  A,|  <  l/(2m),  so  for  m  large,  we  have  A  ss  A.  This  implies 
that  the  constraint  1TA  =  1  is  nearly  satisfied,  for  large  m,  and  also  that  the  error 
covariance  matrices  associated  with  A  and  A  are  close. 

We  can  also  give  an  alternative  interpretation  of  the  relaxed  experiment  design 
problem  (7.25).  We  can  interpret  the  vector  A  €  Rp  as  defining  a  probability 
distribution  on  the  experiments  V\. . . . vp .  Our  choice  of  A  corresponds  to  a  random 
experiment:  each  experiment  a,;  takes  the  form  v3  with  probability  Xj. 

In  the  rest  of  this  section,  we  consider  only  the  relaxed  experiment  design 
problem,  so  we  drop  the  qualifier  ‘relaxed’  in  our  discussion. 


7.5.2  Scalarizations 

Several  scalarizations  have  been  proposed  for  the  experiment  design  problem  (7.25), 
which  is  a  vector  optimization  problem  over  the  positive  semidefinite  cone. 
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E-optimal  design 

The  most  widely  used  scalarization  is  called  D-optimal  design ,  in  which  we  minimize 
the  determinant  of  the  error  covariance  matrix  E.  This  corresponds  to  designing 
the  experiment  to  minimize  the  volume  of  the  resulting  confidence  ellipsoid  (for 
a  fixed  confidence  level).  Ignoring  the  constant  factor  1  /m  in  E,  and  taking  the 
logarithm  of  the  objective,  we  can  pose  this  problem  as 

minimize  logdet  (Ef=i  ^ivivI )  1  (7  26) 

subject  to  A  y  0,  1TA  =  1, 

which  is  a  convex  optimization  problem. 


E-optimal  design 


In  E-optimal  design,  we  minimize  the  norm  of  the  error  covariance  matrix,  i.e., 
the  maximum  eigenvalue  of  E.  Since  the  diameter  (twice  the  longest  semi-axis) 
of  the  confidence  ellipsoid  £  is  proportional  to  ||-E||(j/2,  minimizing  ||E||2  can  be 
interpreted  geometrically  as  minimizing  the  diameter  of  the  confidence  ellipsoid, 
il-optimal  design  can  also  be  interpreted  as  minimizing  the  maximum  variance  of 
qTe,  over  all  q  with  ||qj|2  =  1. 

The  E-optimal  experiment  design  problem  is 


minimize 
subject  to 


(LIU  xivivI) 

XhO,  1TA  =  1. 


The  objective  is  a  convex  function  of  A,  so  this  is  a  convex  problem. 
The  E-optimal  experiment  design  problem  can  be  cast  as  an  SDP 


maximize  t 

subject  to  Ef= i  A iVivJ  >y  tl  (7.27) 

A^O,  1TA  =  1, 

with  variables  A  £  Rp  and  t  £  R. 


A-optimal  design 

In  A-optimal  experiment  design,  we  minimize  tr  E,  the  trace  of  the  covariance 
matrix.  This  objective  is  simply  the  mean  of  the  norm  of  the  error  squared: 

E  || e|| |  =  Etr(eeT)  =  tr  E. 


The  yl-optimal  experiment  design  problem  is 

minimize  tr(£?=i  XiVivT)-1 
subject  to  A  y  0,  1TA  =  1. 


(7.28) 


This,  too,  is  a  convex  problem.  Like  the  E-optimal  experiment  design  problem,  it 
can  be  cast  as  an  SDP: 


minimize 
subject  to 


1  Tu 

'ELiA  iVivf  ek 
ek  Uk 

XhO,  1TA  =  1, 


(y  0,  k  =  1, ...  ,n 
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where  the  variables  are  u  £  R"  and  A  G  Rp,  and  here,  e*,  is  the  fcth  unit  vector. 

Optimal  experiment  design  and  duality 

The  Lagrange  duals  of  the  three  scalarizations  have  an  interesting  geometric  mean¬ 
ing. 

The  dual  of  the  D-optimal  experiment  design  problem  (7.26)  can  be  expressed 
as 

maximize  log  det  W  +  n  log  n 
subject  to  vJWvi<  1,  i  =  l,...,p, 

with  variable  W  G  Sra  and  domain  S?+  (see  exercise  5.10).  This  dual  problem 
has  a  simple  interpretation:  The  optimal  solution  W*  determines  the  minimum 
volume  ellipsoid,  centered  at  the  origin,  given  by  {x  |  xTW*x  <  1},  that  contains 
the  points  v\, . . .  ,vp.  (See  also  the  discussion  of  problem  (5.14)  on  page  222.)  By 
complementary  slackness, 

X*(l-vfW*vi)  =  0,  i  =  l,...,p,  (7.29) 

i.e.,  the  optimal  experiment  design  only  uses  the  experiments  V{  which  lie  on  the 
surface  of  the  minimum  volume  ellipsoid. 

The  duals  of  the  E-optimal  and  A-optimal  design  problems  can  be  given  a 
similar  interpretation.  The  duals  of  problems  (7.27)  and  (7.28)  can  be  expressed 
as 


maximize 

tr  W 

subject  to 

vJWvi  <1,  i  =  1, . 

,.,p 

(7.30) 

wt  o, 

maximize 

(tr  W1/2)2 

(7.31) 

subject  to 

vJWv.i  <1,  i  =  1, . . 

■ ,P , 

respectively.  The  variable  in  both  problems  is  W  G  Sn.  In  the  second  problem 
there  is  an  implicit  constraint  W  G  S" .  (See  exercises  5.40  and  5.10.) 

As  for  the  Z)-optimal  design,  the  optimal  solution  W*  determines  a  minimal 
ellipsoid  {x  \  xTW*x  <  1}  that  contains  the  points  v\,...,vp.  Moreover  W*  and 
A*  satisfy  the  complementary  slackness  conditions  (7.29),  i.e.,  the  optimal  design 
only  uses  experiments  u*  that  lie  on  the  surface  of  the  ellipsoid  defined  by  W*. 

Experiment  design  example 

We  consider  a  problem  with  x  G  R2,  and  p  =  20.  The  20  candidate  measurement 
vectors  at  are  shown  as  circles  in  figure  7.9.  The  origin  is  indicated  with  a  cross. 
The  lA-optimal  experiment  has  only  two  nonzero  A,;,  indicated  as  solid  circles  in 
figure  7.9.  The  .E-optimal  experiment  has  two  nonzero  indicated  as  solid  circles 
in  figure  7.10.  The  A-optimal  experiment  has  three  nonzero  A;,  indicated  as  solid 
circles  in  figure  7.11.  We  also  show  the  three  ellipsoids  {a:  |  xTW*x  <  1}  associated 
with  the  dual  optimal  solutions  W* .  The  resulting  90%  confidence  ellipsoids  are 
shown  in  figure  7.12,  along  with  the  confidence  ellipsoid  for  the  ‘uniform’  design, 
with  equal  weight  A;  =  1/p  on  all  experiments. 
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Figure  7.9  Experiment  design  example.  The  20  candidate  measurement  vec¬ 
tors  are  indicated  with  circles.  The  D-optimal  design  uses  the  two  measure¬ 
ment  vectors  indicated  with  solid  circles,  and  puts  an  equal  weight  Ai  =  0.5 
on  each  of  them.  The  ellipsoid  is  the  minimum  volume  ellipsoid  centered  at 
the  origin,  that  contains  the  points  Vi. 


Figure  7.10  The  E-optimal  design  uses  two  measurement  vectors.  The 
dashed  lines  are  (part  of)  the  boundary  of  the  ellipsoid  {x  \  xTW*x  <  1} 
where  W*  is  the  solution  of  the  dual  problem  (7.30). 


Ai_=  0.30 


o 

o 


A2  =  0.38 


A,  =  0.32 


-•.o  o'- 


Figure  7.11  The  A-optimal  design  uses  three  measurement  vectors.  The 
dashed  line  shows  the  ellipsoid  {x  \  xTW*x  <  1}  associated  with  the  solution 
of  the  dual  problem  (7.31). 
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Figure  7.12  Shape  of  the  90%  confidence  ellipsoids  for  D-optimal,  T-optimal, 
^-optimal,  and  uniform  designs. 


7.5.3  Extensions 

Resource  limits 

Suppose  that  associated  with  each  experiment  is  a  cost  Cj,  which  could  represent 
the  economic  cost,  or  time  required,  to  carry  out  an  experiment  with  'ty .  The  total 
cost,  or  time  required  (if  the  experiments  are  carried  out  sequentially)  is  then 

mici  +  •  •  •  +  mpcp  =  mcT  X. 

We  can  add  a  limit  on  total  cost  by  adding  the  linear  inequality  mcT  A  <  B,  where 
B  is  a  budget,  to  the  basic  experiment  design  problem.  We  can  add  multiple  linear 
inequalities,  representing  limits  on  multiple  resources. 

Multiple  measurements  per  experiment 

We  can  also  consider  a  generalization  in  which  each  experiment  yields  multiple 
measurements.  In  other  words,  when  we  carry  out  an  experiment  using  one  of  the 
possible  choices,  we  obtain  several  measurements.  To  model  this  situation  we  can 
use  the  same  notation  as  before,  with  u,  as  matrices  in  R"xfei: 

ry  —  [  un  *  *  *  ^iki  ]  , 

where  ki  is  the  number  of  (scalar)  measurements  obtained  when  the  experiment  Vi 
is  carried  out.  The  error  covariance  matrix,  in  this  more  complicated  setup,  has 
the  exact  same  form. 

In  conjunction  with  additional  linear  inequalities  representing  limits  on  cost  or 
time,  we  can  model  discounts  or  time  savings  associated  with  performing  groups 
of  measurements  simultaneously.  Suppose,  for  example,  that  the  cost  of  simulta¬ 
neously  making  (scalar)  measurements  V\  and  V2  is  less  than  the  sum  of  the  costs 
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of  making  them  separately.  We  can  take  V3  to  be  the  matrix 


^3  =  [  Vi  v2 


and  assign  costs  Ci,  c2,  and  C3  associated  with  making  the  first  measurement  alone, 
the  second  measurement  alone,  and  the  two  simultaneously,  respectively. 

When  we  solve  the  experiment  design  problem,  Ai  will  give  us  the  fraction  of 
times  we  should  carry  out  the  first  experiment  alone,  A2  will  give  us  the  fraction 
of  times  we  should  carry  out  the  second  experiment  alone,  and  A3  will  give  us 
the  fraction  of  times  we  should  carry  out  the  two  experiments  simultaneously. 
(Normally  we  would  expect  a  choice  to  be  made  here;  we  would  not  expect  to  have 
Ai  >  0,  A2  >  0,  and  A3  >  0.) 
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Estimation 

7.1  Linear  measurements  with  exponentially  distributed  noise.  Show  how  to  solve  the  ML 
estimation  problem  (7.2)  when  the  noise  is  exponentially  distributed,  with  density 


p(z)  = 


(1  /a)e~z/a  z>  0 
0  z  <0, 


where  a  >  0. 

7.2  ML  estimation  and  loo  -norm  approximation.  We  consider  the  linear  measurement  model 
y  =  Ax  +  v  of  page  352,  with  a  uniform  noise  distribution  of  the  form 

v(z)  _  /  V(2a)  M  <  « 

P[Z)  ~\0  \z\  >  a. 

As  mentioned  in  example  7.1,  page  352,  any  x  that  satisfies  \\Ax  —  y||oo  <  a  is  a  ML 
estimate. 

Now  assume  that  the  parameter  a  is  not  known,  and  we  wish  to  estimate  a,  along  with 
the  parameters  x.  Show  that  the  ML  estimates  of  x  and  a  are  found  by  solving  the 
foo-norm  approximation  problem 

minimize  1 1  Ax  —  y  |  |  , 


where  aj  are  the  rows  of  A. 

7.3  Probit  model.  Suppose  y  £  {0, 1}  is  random  variable  given  by 

_  (  1  aTu  +  b  +  v  <  0 

P  f  0  aTit  +  6  +  v>0, 

where  the  vector  u  £  Ft"  is  a  vector  of  explanatory  variables  (as  in  the  logistic  model 
described  on  page  354),  and  v  is  a  zero  mean  unit  variance  Gaussian  variable. 

Formulate  the  ML  estimation  problem  of  estimating  a  and  b,  given  data  consisting  of 
pairs  ( Ui ,  yi),  i  =  1, . . . ,  N,  as  a  convex  optimization  problem. 

7.4  Estimation  of  covariance  and  mean  of  a  multivariate  normal  distribution.  We  consider  the 
problem  of  estimating  the  covariance  matrix  R  and  the  mean  a  of  a  Gaussian  probability 
density  function 

PrAv)  =  (2tt)-"/2  det(7?)'1/2  exp (-(y  -  a)TR~1{y  -  a)/ 2), 

based  on  N  independent  samples  y i,  1/2,  •  •  • ,  2/jv  £  Ft". 

(a)  We  first  consider  the  estimation  problem  when  there  are  no  additional  constraints 
on  R  and  a.  Let  /j  and  Y  be  the  sample  mean  and  covariance,  defined  as 

1  N  JV 

P  =  jy  yk’  Y  =  N  ~ 

k= 1  k= 1 

Show  that  the  log-likelihood  function 


N 

l(R,a)  =  — (iVn/2)  log(27r)  -  (N/2)  logdet  R  -  (1/2)  -a)TR~1(yk  -  a) 

k= 1 
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can  be  expressed  as 

l(R,  a)  =  ^  (— nlog(27r)  —  logdet  R  —  tr(i?-1F)  —  (a  —  p)T  R^1(a  —  / i ;))  . 

Use  this  expression  to  show  that  if  Y  >-  0,  the  ML  estimates  of  R  and  a  are  unique, 
and  given  by 

C  |  1 1  /l  ■  ff'TI  1 1  f  . 

(b)  The  log-likelihood  function  includes  a  convex  term  (—  logdet R),  so  it  is  not  obvi¬ 
ously  concave.  Show  that  l  is  concave,  jointly  in  R  and  a,  in  the  region  defined 
by 

R  X  2  Y. 

This  means  we  can  use  convex  optimization  to  compute  simultaneous  ML  estimates 
of  R  and  a,  subject  to  convex  constraints,  as  long  as  the  constraints  include  R  X  2Y, 
i.e.,  the  estimate  R  must  not  exceed  twice  the  unconstrained  ML  estimate. 


7.5  Markov  chain  estimation.  Consider  a  Markov  chain  with  n  states,  and  transition  proba¬ 
bility  matrix  P  £  Rnxn  defined  as 


Pij  =  prob(j/(t  +  1)  =  i  \  y(t)  =  j). 

The  transition  probabilities  must  satisfy  Pij  >  0  and  y~k_-,  Pg  =  1,  j  =  1, ...  ,n.  We 
consider  the  problem  of  estimating  the  transition  probabilities,  given  an  observed  sample 
sequence  y(  1)  =  fci,  y( 2)  =  k2,  . . . ,  y(N)  =  k„. 

(a)  Show  that  if  there  are  no  other  prior  constraints  on  Pij,  then  the  ML  estimates  are 
the  empirical  transition  frequencies:  Pij  is  the  ratio  of  the  number  of  times  the  state 
transitioned  from  j  into  i,  divided  by  the  number  of  times  it  was  j,  in  the  observed 
sample. 

(b)  Suppose  that  an  equilibrium  distribution  p  of  the  Markov  chain  is  known,  i.e.,  a 
vector  q  £  R”  satisfying  1  Tq  =  1  and  Pq  =  q.  Show  that  the  problem  of  computing 
the  ML  estimate  of  P,  given  the  observed  sequence  and  knowledge  of  q,  can  be 
expressed  as  a  convex  optimization  problem. 

7.6  Estimation  of  mean  and  variance.  Consider  a  random  variable  x  £  R  with  density  p, 
which  is  normalized,  i.e.,  has  zero  mean  and  unit  variance.  Consider  a  random  variable 
y  =  (x+b)/a  obtained  by  an  affine  transformation  of  x,  where  a  >  0.  The  random  variable 
y  has  mean  b  and  variance  1/a2.  As  a  and  b  vary  over  R+  and  R,  respectively,  we  generate 
a  family  of  densities  obtained  from  p  by  scaling  and  shifting,  uniquely  parametrized  by 
mean  and  variance. 

Show  that  if  p  is  log-concave,  then  finding  the  ML  estimate  of  a  and  b,  given  samples 
j/i, . . . ,  yn  of  y,  is  a  convex  problem. 

As  an  example,  work  out  an  analytical  solution  for  the  ML  estimates  of  a  and  b,  assuming 
p  is  a  normalized  Laplacian  density,  p(x)  =  e_2^L 

7.7  ML  estimation  of  Poisson  distributions.  Suppose  Xi,  i  =  1, . . . ,  n,  are  independent  random 
variables  with  Poisson  distributions 


prob(a:i  =  k) 


with  unknown  means  ju.  The  variables  Xi  represent  the  number  of  times  that  one  of  n 
possible  independent  events  occurs  during  a  certain  period.  In  emission  tomography,  for 
example,  they  might  represent  the  number  of  photons  emitted  by  n  sources. 

We  consider  an  experiment  designed  to  determine  the  means  (m.  The  experiment  involves 
m.  detectors.  If  event  i  occurs,  it  is  detected  by  detector  j  with  probability  pji.  We  assume 
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the  probabilities  Pji  are  given  (with  pji  >  0,  y pg  <  1).  The  total  number  of  events 
recorded  by  detector  j  is  denoted  yj, 


n 

Vi  =  y j  = 

i=  1 


Formulate  the  ML  estimation  problem  of  estimating  the  means  pt,  based  on  observed 
values  of  yj,  j  =  1, . . . ,  m,  as  a  convex  optimization  problem. 

Hint.  The  variables  yji  have  Poisson  distributions  with  means  PjiPi,  i.e., 


prob(j/ji  =  k) 


e~P:iitli{piiVi)k 

k\ 


The  sum  of  n  independent  Poisson  variables  with  means  Ai ,  . . . ,  X„  has  a  Poisson  distri¬ 
bution  with  mean  Ai  +  •  •  •  +  An. 

7.8  Estimation  using  sign  measurements.  We  consider  the  measurement  setup 


yi  =  sign(ajx  +  bi  +  Vi),  *=!,...,  to, 


where  x  £  R"  is  the  vector  to  be  estimated,  and  yt  £  (  —  1, 1}  are  the  measurements.  The 
vectors  ai  £  R"  and  scalars  6;  £  R  are  known,  and  Vi  are  IID  noises  with  a  log-concave 
probability  density.  (You  can  assume  that  af  x  +  bi  +  n  =0  does  not  occur.)  Show  that 
maximum  likelihood  estimation  of  a:  is  a  convex  optimization  problem. 

7.9  Estimation  with  unknown  sensor  nonlinearity.  We  consider  the  measurement  setup 

Vi  =  f(ajx  +  bi  +  Vi),  *  =  !,...,  to, 


where  x  £  R"  is  the  vector  to  be  estimated,  yt  £  R  are  the  measurements,  at  £  Rn, 
h  £  R  are  known,  and  Vi  are  IID  noises  with  log-concave  probability  density.  The  function 
/  :  R  ->  R,  which  represents  a  measurement  nonlinearity,  is  not  known.  However,  it  is 
known  that  f'(t)  £  [l,u]  for  all  t,  where  0  <1  <u  are  given. 

Explain  how  to  use  convex  optimization  to  find  a  maximum  likelihood  estimate  of  x,  as 
well  as  the  function  /.  (This  is  an  infinite-dimensional  ML  estimation  problem,  but  you 
can  be  informal  in  your  approach  and  explanation.) 

7.10  Nonparametric  distributions  on  Rfc.  We  consider  a  random  variable  x  £  Rfc  with  values 
in  a  finite  set  {au, . . . ,  a„},  and  with  distribution 

Pi  =  prob(x  =  at),  i  —  1 . ,•». 


Show  that  a  lower  bound  on  the  covariance  of  A', 

S  X  E(X  -  EX)(X  -  EX)t, 


is  a  convex  constraint  in  p. 

Optimal  detector  design 

7.11  Randomized  detectors.  Show  that  every  randomized  detector  can  be  expressed  as  a  convex 
combination  of  a  set  of  deterministic  detectors:  If 

T=[t  1  t2  ■■■  tn  ]  £  RmX" 
satisfies  tk  A  0  and  lTtk  =  1,  then  T  can  be  expressed  as 


T  =  6i  Ti  +  •  •  •  +  9nTn, 
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where  Tt  is  a  zero-one  matrix  with  exactly  one  element  equal  to  one  per  column,  and 
9i  >  0,  9i  =  1.  What  is  the  maximum  number  of  deterministic  detectors  N  we  may 

need? 

We  can  interpret  this  convex  decomposition  as  follows.  The  randomized  detector  can  be 
realized  as  a  bank  of  N  deterministic  detectors.  When  we  observe  X  =  k,  the  estimator 
chooses  a  random  index  from  the  set  {1, . . . ,  N},  with  probability  prob(j  =  i)  =  9i,  and 
then  uses  deterministic  detector  Tj. 

7.12  Optimal  action.  In  detector  design,  we  are  given  a  matrix  P  £  Rnxm  (whose  columns 
are  probability  distributions),  and  then  design  a  matrix  T  £  Rmxn  (whose  columns  are 
probability  distributions),  so  that  D  =  TP  has  large  diagonal  elements  (and  small  off- 
diagonal  elements).  In  this  problem  we  study  the  dual  problem:  Given  P,  find  a  matrix 
S  £  Rmx”  (whose  columns  are  probability  distributions),  so  that  D  =  PS  £  Rnxn  has 
large  diagonal  elements  (and  small  off-diagonal  elements).  To  make  the  problem  specific, 
we  take  the  objective  to  be  maximizing  the  minimum  element  of  D  on  the  diagonal. 

We  can  interpret  this  problem  as  follows.  There  are  n  outcomes,  which  depend  (stochas¬ 
tically)  on  which  of  m  inputs  or  actions  we  take:  Pij  is  the  probability  that  outcome  i 
occurs,  given  action  j.  Our  goal  is  find  a  (randomized)  strategy  that,  to  the  extent  pos¬ 
sible,  causes  any  specified  outcome  to  occur.  The  strategy  is  given  by  the  matrix  S:  Sri 
is  the  probability  that  we  take  action  j,  when  we  want  outcome  i  to  occur.  The  matrix 
D  gives  the  action  error  probability  matrix:  Dij  is  the  probability  that  outcome  i  occurs, 
when  we  want  outcome  j  to  occur.  In  particular,  Dvl  is  the  probability  that  outcome  i 
occurs,  when  we  want  it  to  occur. 

Show  that  this  problem  has  a  simple  analytical  solution.  Show  that  (unlike  the  corre¬ 
sponding  detector  problem)  there  is  always  an  optimal  solution  that  is  deterministic. 
Hint.  Show  that  the  problem  is  separable  in  the  columns  of  S. 

Chebyshev  and  Chernoff  bounds 

7.13  Chebyshev-type  inequalities  on  a  finite  set.  Assume  A  is  a  random  variable  taking  values 
in  the  set  {au,  «2,  ■  ■  ■ ,  am},  and  let  S'  be  a  subset  of  {ou, . . . ,  am}.  The  distribution  of  X 
is  unknown,  but  we  are  given  the  expected  values  of  n  functions  /,: 

E  ft  (A' )  =  bi,  i  =  1, . . .  ,n.  (7.32) 

Show  that  the  optimal  value  of  the  LP 

minimize  xo  +  E"=i  ^iXi 
subject  to  xo  +  Xo=i  fi(a)xi  >  1>  a  £  S 
xo  +  ELi  fi(a)xi  >  0,  a  S, 

with  variables  xo,  ■  ■  ■ ,  xn,  is  an  upper  bound  on  prob(X  £  S),  valid  for  all  distributions 
that  satisfy  (7.32).  Show  that  there  always  exists  a  distribution  that  achieves  the  upper 
bound. 


Chapter  8 

Geometric  problems 


8.1  Projection  on  a  set 

The  distance  of  a  point  x0  £  R"  to  a  closed  set  C  C  Rn,  in  the  norm  ||  •  ||,  is 
defined  as 

dist(cco,  C)  =  inf{||.To  —  x\\  \  x  €  C}. 

The  infimum  here  is  always  achieved.  We  refer  to  any  point  z  £  C  which  is  closest 
to  Xq,  i.e.,  satisfies  \\z  —  xo||  =  dist(xo,  C),  as  a  projection  of  xq  on  C.  In  general 
there  can  be  more  than  one  projection  of  Xq  on  C,  i.e.,  several  points  in  C  closest 
to  Xq. 

In  some  special  cases  we  can  establish  that  the  projection  of  a  point  on  a  set 
is  unique.  For  example,  if  C  is  closed  and  convex,  and  the  norm  is  strictly  convex 
(e.g.,  the  Euclidean  norm),  then  for  any  xo  there  is  always  exactly  one  z  £  C  which 
is  closest  to  xo-  As  an  interesting  converse,  we  have  the  following  result:  If  for  every 
xo  there  is  a  unique  Euclidean  projection  of  xq  on  C,  then  C  is  closed  and  convex 
(see  exercise  8.2). 

We  use  the  notation  Pq  :  Rra  — >  R"  to  denote  any  function  for  which  Pc{x o) 
is  a  projection  of  xq  on  C,  i.e.,  for  all  Xo, 

Pc{xo)  £  C,  ||x0  -  Pc(*o)||  =  dist(x0,  C). 

In  other  words,  we  have 

Pc(x o)  =  argmin{||a:  -  x0||  |  x  £  C}. 

We  refer  to  Pq  as  projection  on  C. 


Example  8.1  Projection  on  the  unit  square  in  R2.  Consider  the  (boundary  of  the) 
unit  square  in  R2,  i.e.,  C  =  {*  £  R2  |  ||x||oo  =  1}-  We  take  xo  =  0. 

In  the  fu-norm,  the  four  points  (1,  0),  (0,  —1),  (— 1,  0),  and  (0, 1)  are  closest  to  xo  =  0, 
with  distance  1,  so  we  have  dist(*o,  C)  =  1  in  the  ti-norm.  The  same  statement  holds 
for  the  ^2-norm. 

In  the  loo- norm,  all  points  in  C  lie  at  a  distance  1  from  xo,  and  dist(xo,  C)  =  1. 
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Example  8.2  Projection  onto  rank-k  matrices.  Consider  the  set  ofmxn  matrices 
with  rank  less  than  or  equal  to  k, 

C  =  {X  £  Rmxn  |  rank  X  <  k}, 

with  k  <  min {m,n},  and  let  A'o  €  Rmxn.  We  can  find  a  projection  of  Xq  on 
C,  in  the  (spectral  or  maximum  singular  value)  norm  ||  •  || 2 ,  via  the  singular  value 
decomposition.  Let 

r 

X0  =  ^  aiUivf 

i=  1 

be  the  singular  value  decomposition  of  A'o,  where  r  =  rank.Yo.  Then  the  matrix 
Y  =  aiUivJ  is  a  projection  of  A'o  on  C. 


8.1.1  Projecting  a  point  on  a  convex  set 

If  C  is  convex,  then  we  can  compute  the  projection  Pc(x 0)  and  the  distance 
dist(xo,C)  by  solving  a  convex  optimization  problem.  We  represent  the  set  C 
by  a  set  of  linear  equalities  and  convex  inequalities 

Ax  =  b,  fi(x)  <  0,  i  =  l,...,m|  (8.1) 

and  find  the  projection  of  xo  on  C  by  solving  the  problem 

minimize  ||x  —  Xo|| 

subject  to  fi(x)  <0,  i  =  1, . . . ,  m  (8.2) 

Ax  =  b , 

with  variable  x.  This  problem  is  feasible  if  and  only  if  C  is  nonempty;  when  it  is 
feasible,  its  optimal  value  is  dist(a;o,  C),  and  any  optimal  point  is  a  projection  of 
Xq  on  C. 


Euclidean  projection  on  a  polyhedron 

The  projection  of  xq  on  a  polyhedron  described  by  linear  inequalities  Ax  A  b  can 
be  computed  by  solving  the  QP 

minimize  ||x  — X0II2 
subject  to  Ax  <  b. 

Some  special  cases  have  simple  analytical  solutions. 

•  The  Euclidean  projection  of  Xo  on  a  hyperplane  C  =  {x  |  aTx  =  b}  is  given 

by 


Pc(x 0)  =  Xq  +  {b-  aTx0)a/||a||2 


•  The  Euclidean  projection  of  xq  on  a  halfspace  C  =  {x  |  aTx  <  b}  is  given  by 

P  (  )  =  /  ^  +  (6-aTx0)a/||a|||  aTx0>b 

'  0  \  Xq  aTX 0  <  b. 
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•  The  Euclidean  projection  of  Xo  on  a  rectangle  C  =  {x\l^x<uj  (where 
l  -<  u)  is  given  by 


Pc(x0)k 


Ik  x0k  —  Ik 
Xok  Ik  —  x0k  Iz:  ^k 

'U'k  X0k  zl  Uk- 


Euclidean  projection  on  a  proper  cone 

Let  x  =  Pk{xq)  denote  the  Euclidean  projection  of  a  point  Xo  on  a  proper  cone  K. 
The  KKT  conditions  of 

minimize  ||x  —  Xolli 
subject  to  x  >zk  0 

are  given  by 

X  >ZK  0,  X  —  Xo  =  Z,  Z  '(ZK*  0,  zT x  =  0. 


Introducing  the  notation  x+  =  x  and  x_  =  z,  we  can  express  these  conditions  as 


xq  =  x+  —  X-,  x+  ^ k  0, 


x -  ClK*  0,  X^X-  =  0. 


In  other  words,  by  projecting  xo  on  the  cone  AT,  we  decompose  it  into  the  difference 
of  two  orthogonal  elements:  one  nonnegative  with  respect  to  K  (and  which  is  the 
projection  of  Xo  on  AT),  and  the  other  nonnegative  with  respect  to  K* . 

Some  specific  examples: 

•  For  AT  =  R",  we  have  Pf<{xo)k  =  max{a;ofc, 0}.  The  Euclidean  projection 
of  a  vector  onto  the  nonnegative  orthant  is  found  by  replacing  each  negative 
component  with  0. 

•  For  K  =  S"  ,  and  the  Euclidean  (or  Frobenius)  norm  ||  •  ||^,  we  have  Pk{X 0)  = 

max{0,  Xi}vivJ ,  where  Xo  =  X^"=i  ^ ivivT  is  the  eigenvalue  decomposi¬ 
tion  of  Xq.  To  project  a  symmetric  matrix  onto  the  positive  semidefinite  cone, 
we  form  its  eigenvalue  expansion  and  drop  terms  associated  with  negative 
eigenvalues.  This  matrix  is  also  the  projection  onto  the  positive  semidefinite 
cone  in  the  1 2 -,  or  spectral  norm. 


8.1.2  Separating  a  point  and  a  convex  set 

Suppose  C  is  a  closed  convex  set  described  by  the  equalities  and  inequalities  (8.1). 
If  Xo  G  C,  then  dist(#o,C')  =  0,  and  the  optimal  point  for  the  problem  (8.2)  is 
Xo.  If  xq  ^  C  then  dist(xo,C)  >  0,  and  the  optimal  value  of  the  problem  (8.2)  is 
positive.  In  this  case  we  will  see  that  any  dual  optimal  point  provides  a  separating 
hyperplane  between  the  point  Xo  and  the  set  C. 

The  link  between  projecting  a  point  on  a  convex  set  and  finding  a  hyperplane 
that  separates  them  (when  the  point  is  not  in  the  set)  should  not  be  surprising. 
Indeed,  our  proof  of  the  separating  hyperplane  theorem,  given  in  §2.5.1,  relies  on 
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Figure  8.1  A  point  Xo  and  its  Euclidean  projection  Pc(x o)  on  a  convex  set  C. 
The  hyperplane  midway  between  the  two,  with  normal  vector  Pc(x o)  —  xo, 
strictly  separates  the  point  and  the  set.  This  property  does  not  hold  for 
general  norms;  see  exercise  8.4. 


finding  the  Euclidean  distance  between  the  sets.  If  Pc(x o)  denotes  the  Euclidean 
projection  of  xq  on  C,  where  xo  ^  C,  then  the  hyperplane 

(Pc{x o)  -  Xq)t(x  -  (l/2)(x0  +  Pc(x0)))  =  0 

(strictly)  separates  Xq  from  C,  as  illustrated  in  figure  8.1.  In  other  norms,  however, 
the  clearest  link  between  the  projection  problem  and  the  separating  hyperplane 
problem  is  via  Lagrange  duality. 

We  first  express  (8.2)  as 

minimize  ||j/|| 

subject  to  fi{x)  <0,  i  =  1, . . . ,  m 
Ax  =  b 
x0  -  x  =  y 


with  variables  x  and  y.  The  Lagrangian  of  this  problem  is 


m 

L{x,  y,  A,  n,  v)  =  ||y||  +  ^  ^ifi(x)  +  vT (Ax  -  b)  +  yT(x0  -  x  -  y) 

i=l 


and  the  dual  function  is 


g(  A,  /x,  v) 


inf®  (SHi  Kfi(x)  +  vT(Ax  -  b)  +  nT(x0  -  x))  ||/x||»  <  1 

—00  otherwise, 


so  we  obtain  the  dual  problem 

maximize  yTXo  +  inix  (Sill  ^ ifi(x )  +  vT(Ax  —  b)  —  yTx) 
subject  to  A  y  0 

MU  <  1, 


with  variables  A,  (i.  v.  We  can  interpret  the  dual  problem  as  follows.  Suppose  A, 
ix,  v  are  dual  feasible  with  a  positive  dual  objective  value,  i.e.,  A  >z  0,  ||^||*  <  1, 
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and 

m 

fiTxo  —  /i T x  +  ^  A ifi(x)  +  vt(Ax  —  b)  >  0 

i- 1 

for  all  x.  This  implies  that  y tXq  >  x  for  x  £  C,  and  therefore  p  defines  a 
strictly  separating  hyperplane.  In  particular,  suppose  (8.2)  is  strictly  feasible,  so 
strong  duality  holds.  If  £o  C,  the  optimal  value  is  positive,  and  any  dual  optimal 
solution  defines  a  strictly  separating  hyperplane. 

Note  that  this  construction  of  a  separating  hyperplane,  via  duality,  works  for 
any  norm.  In  contrast,  the  simple  construction  described  above  only  works  for  the 
Euclidean  norm. 


Separating  a  point  from  a 

polyhedron 

The  dual  problem  of 

minimize 
subject  to 

IMI 

Ax  +  b 
x0-x  =  y 

is 

maximize 
subject  to 

yTxo  —  bT  X 
ATX  =  y 

IHI*  <  1 

A  +  0 

which  can  be  further  simplified  as 

maximize 
subject  to 

(Axq  —  b)T  X 
PTA|U  <  1 
A  +  0. 

It  is  easily  verified  that  if  the  dual  objective  is  positive,  then  ^4TA  is  the  normal 
vector  to  a  separating  hyperplane:  If  Ax  +  6,  then 

( ATX)Tx  =  XT (Ax)  <  A Tb  <  XT Axo, 

so  n  =  AT X  defines  a  separating  hyperplane. 


8.1.3  Projection  and  separation  via  indicator  and  support  functions 

The  ideas  described  above  in  §8.1.1  and  §8.1.2  can  be  expressed  in  a  compact  form 
in  terms  of  the  indicator  function  Iq  and  the  support  function  Sc  of  the  set  C, 
defined  as 

Sc(x)  =  sup  xTy,  Ic(x)  = 

yec 

The  problem  of  projecting  Xq  on  a  closed  convex  set  C  can  be  expressed  compactly 
as 

minimize  ||x  — xo|| 
subject  to  Ic(x)  <  0, 


f  0  x  £  C 
\  +oo  x  ^  C. 
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or,  equivalently,  as 


minimize  ||j/|| 
subject  to  Ic(x)  <  0 
x0-  x  =  y 


where  the  variables  are  x  and  y.  The  dual  function  of  this  problem  is 
9{z ,  A)  =  inf  (\\y\\  +  XIc(x)  +  zT{x 0  -  x  -  y)) 

x,y 

ztxq  +  infx  (- zTx  +  Ic(%))  INI*  <  1,  A  >  0 


— oo 

ZTXo  -  SC(z) 
— oo 


otherwise 


IMI*<1,  A  >  0 

otherwise 


so  we  obtain  the  dual  problem 


maximize  zTx  o  —  Sc{z) 
subject  to  ||z||*  <  1. 


If  z  is  dual  optimal  with  a  positive  objective  value,  then  zTx o  >  zTx  for  all  x  £  C, 
i.e.,  z  defines  a  separating  hyperplane. 


8.2  Distance  between  sets 

The  distance  between  two  sets  C  and  D,  in  a  norm  ||  •  |.  is  defined  as 

dist(C,  D)  =  inf{||x  —  j/||  |  x  £  C,  y  €  D}. 

The  two  sets  C  and  D  do  not  intersect  if  dist(C,  D)  >  0.  They  intersect  if 
dist(C,  D)  =  0  and  the  infimum  in  the  definition  is  attained  (which  is  the  case,  for 
example,  if  the  sets  are  closed  and  one  of  the  sets  is  bounded) . 

The  distance  between  sets  can  be  expressed  in  terms  of  the  distance  between  a 
point  and  a  set, 

dist(C,  D)  =  dist(0,  D  —  C), 

so  the  results  of  the  previous  section  can  be  applied.  In  this  section,  however,  we 
derive  results  specifically  for  problems  involving  distance  between  sets.  This  allows 
us  to  exploit  the  structure  of  the  set  C  —  D1  and  makes  the  interpretation  easier. 


8.2.1  Computing  the  distance  between  convex  sets 

Suppose  C  and  D  are  described  by  two  sets  of  convex  inequalities 

C  =  {x  |  fi(x)  <0,  i  =  1, . . .  ,m},  D  =  {x\  g.i(x)  <0,  i  =  l,...,p}. 
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Figure  8.2  Euclidean  distance  between  polyhedra  C  and  D.  The  dashed  line 
connects  the  two  points  in  C  and  D ,  respectively,  that  are  closest  to  each 
other  in  Euclidean  norm.  These  points  can  be  found  by  solving  a  QP. 


(We  can  include  linear  equalities,  but  exclude  them  here  for  simplicity.)  We  can 
find  dist(C,  D)  by  solving  the  convex  optimization  problem 

minimize  \\x  —  y\\ 

subject  to  fi{x)  <0,  i  =  1, . . . ,  m  (8.3) 

9i{y)  <  0,  i  =  l,...,p. 

Euclidean  distance  between  polyhedra 

Let  C  and  D  be  two  polyhedra  described  by  the  sets  of  linear  inequalities  A\x  <  b\ 
and  A2x  -<  b2,  respectively.  The  distance  between  C  and  D  is  the  distance  between 
the  closest  pair  of  points,  one  in  C  and  the  other  in  D ,  as  illustrated  in  figure  8.2. 
The  distance  between  them  is  the  optimal  value  of  the  problem 

minimize  \\x  —  y||  2 

subject  to  A\x  -<  bi  (8.4) 

A2y  <  b2- 

We  can  square  the  objective  to  obtain  an  equivalent  QP. 


8.2.2  Separating  convex  sets 


The  dual  of  the  problem  (8.3)  of  finding  the  distance  between  two  convex  sets  has 
an  interesting  geometric  interpretation  in  terms  of  separating  hyperplanes  between 
the  sets.  We  first  express  the  problem  in  the  following  equivalent  form: 


minimize  ||ru|| 

subject  to  fi(x)  <0,  i  =  1, . . . ,  m 

9i{y)  <  0,  i  =  l,...,p 

x  —  y  =  w. 


The  dual  function  is 


(8.5) 
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f  infx  (E^=i  +  zTx )  +  infy  (Ef=i  9i9i(y)  ~  zTy)  INI*  <  1 

\  — oo  otherwise, 

which  results  in  the  dual  problem 

maximize  infx  (E™  i \fi{x)  +  zT x)  +  infy  (Ef=i  9i9i{v)  ~  zTy) 
subject  to  ||2||*  <  1  (8.6) 

A  y  0,  yh  0. 

We  can  interpret  this  geometrically  as  follows.  If  A,  y  are  dual  feasible  with  a 
positive  objective  value,  then 

m  p 

T.  A ifi(x)  +  zTx  +  y  mgi{y)  -  zTy  >  0 

i=  1  i= 1 

for  all  x  and  y.  In  particular,  for  x  £  C  and  y  £  D,  we  have  zTx  —  zTy  >  0,  so  we 
see  that  2  defines  a  hyperplane  that  strictly  separates  C  and  D. 

Therefore,  if  strong  duality  holds  between  the  two  problems  (8.5)  and  (8.6) 
(which  is  the  case  when  (8.5)  is  strictly  feasible),  we  can  make  the  following  con¬ 
clusion.  If  the  distance  between  the  two  sets  is  positive,  then  they  can  be  strictly 
separated  by  a  hyperplane. 

Separating  polyhedra 

Applying  these  duality  results  to  sets  defined  by  linear  inequalities  A\X  A  b\  and 
A2X  A  62,  we  find  the  dual  problem 

maximize  —bJX^b^y 
subject  to  A 1  A  +  2  =  0 
A2  y,  —  z  =  0 

INI*  <  1 

A  ^  0,  /ibO. 

If  A,  /i,  and  2  are  dual  feasible,  then  for  all  x  £  C,  y  €  D, 

zT x  =  —\tAiX  >  —\Tbi,  zT y  =  yT A^x  < 

and,  if  the  dual  objective  value  is  positive, 

zTx  —  zTy  >  —\Tbi  —  yTb2  >  0, 
i.e.,  z  defines  a  separating  hyperplane. 

8.2.3  Distance  and  separation  via  indicator  and  support  functions 

The  ideas  described  above  in  §8.2.1  and  §8.2.2  can  be  expressed  in  a  compact  form 
using  indicator  and  support  functions.  The  problem  of  finding  the  distance  between 
two  convex  sets  can  be  posed  as  the  convex  problem 

minimize  ||x  —  y\\ 
subject  to  Ic{x)  <  0 

Id\v)  <  0, 
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which  is  equivalent  to 

minimize  ||w|| 

subject  to  Ic{x)  <  0 

Id(v)  <  0 

x  —  y  =  w. 

The  dual  of  this  problem  is 

maximize  —  Sc{— z)  —  Sd(z) 
subject  to  ||z||*  <  1. 

If  s  is  dual  feasible  with  a  positive  objective  value,  then  Sd{z)  <  —Sc(—z),  i.e., 

sup  zT x  <  inf  zT x. 
xeD  xec 

In  other  words,  z  defines  a  hyperplane  that  strictly  separates  C  and  D. 


8.3  Euclidean  distance  and  angle  problems 

Suppose  ai, . . . ,  an  is  a  set  of  vectors  in  R",  which  we  assume  (for  now)  have  known 
Euclidean  lengths 

h  =  II  O-l  II 2 ,  •  ■  •  )  In  =  1 1  1 1 2  ■ 

We  will  refer  to  the  set  of  vectors  as  a  configuration,  or,  when  they  are  indepen¬ 
dent,  a  basis.  In  this  section  we  consider  optimization  problems  involving  various 
geometric  properties  of  the  configuration,  such  as  the  Euclidean  distances  between 
pairs  of  the  vectors,  the  angles  between  pairs  of  the  vectors,  and  various  geometric 
measures  of  the  conditioning  of  the  basis. 


8.3.1  Gram  matrix  and  realizability 

The  lengths,  distances,  and  angles  can  be  expressed  in  terms  of  the  Gram  matrix 
associated  with  the  vectors  a±, ...  ,an,  given  by 

G  =  AT A,  A  =  [  a\  ■  ■  ■  an  ]  , 

so  that  Gij  =  af  aj .  The  diagonal  entries  of  G  are  given  by 

Gu  =  If  j  i  =  1,  •  •  •  ,  Tl, 

which  (for  now)  we  assume  are  known  and  fixed.  The  distance  djj  between  at  and 
aj  is 


dij  =  II  Oj  —  aj  ||2 

=  Vi+lj  -2afaj)1/2 

=  ( Z?  +  J?-2Gy)1 /2- 
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Conversely,  we  can  express  G,y  in  terms  of  dij  as 

;2  ,  ;2  _  d2 

r,  _  h  ^  Lj  uij 

Uij  ~  2 

which  we  note,  for  future  reference,  is  an  affine  function  of  dfj. 

The  correlation  coefficient  pij  between  (nonzero)  Oj  and  aj  is  given  by 

o  ',  <i,i  _  Gij 

Pii  INHNb  “  W 

so  that  Gij  =  Id j Pij  is  a  linear  function  of  pij.  The  angle  6ij  between  (nonzero)  a.j 
and  aj  is  given  by 

dij  =  cos'1  p.^  =  cos  ~1{Gij/{lilj)), 

where  we  take  cos-1  p  €  [0, 7r] .  Thus,  we  have  Gij  =  Ulj  cosfty. 

The  lengths,  distances,  and  angles  are  invariant  under  orthogonal  transforma¬ 
tions:  If  Q  €  R”xn  is  orthogonal,  then  the  set  of  vectors  Qai, . . .  ,Qan  has  the 
same  Gram  matrix,  and  therefore  the  same  lengths,  distances,  and  angles. 

Realizability 

The  Gram  matrix  G  =  AT  A  is,  of  course,  symmetric  and  positive  semidefmite.  The 
converse  is  a  basic  result  of  linear  algebra:  A  matrix  G  €  Sn  is  the  Gram  matrix 
of  a  set  of  vectors  aq, . . .  ,an  if  and  only  if  G  >:  0.  When  G  y  0,  we  can  construct 
a  configuration  with  Gram  matrix  G  by  finding  a  matrix  A  with  AT A  =  G.  One 
solution  of  this  equation  is  the  symmetric  squareroot  A  =  G1/2.  When  G  >-  0,  we 
can  find  a  solution  via  the  Cholesky  factorization  of  G:  If  LLT  =  G,  then  we  can 
take  A  =  LT .  Moreover,  we  can  construct  all  configurations  with  the  given  Gram 
matrix  G,  given  any  one  solution  A,  by  orthogonal  transformation:  If  AT A  =  G  is 
any  solution,  then  A  =  QA  for  some  orthogonal  matrix  Q. 

Thus,  a  set  of  lengths,  distances,  and  angles  (or  correlation  coefficients)  is  real¬ 
izable ,  i.e.,  those  of  some  configuration,  if  and  only  if  the  associated  Gram  matrix 
G  is  positive  semidehnite,  and  has  diagonal  elements  l\, . . . ,  I2 . 

We  can  use  this  fact  to  express  several  geometric  problems  as  convex  optimiza¬ 
tion  problems,  with  G  £  S"  as  the  optimization  variable.  Realizability  imposes 
the  constraint  G  y  0  and  Gu  =  I’f,  i  =  1, . . . ,  n;  we  list  below  several  other  convex 
constraints  and  objectives. 

Angle  and  distance  constraints 

We  can  fix  an  angle  to  have  a  certain  value,  0t]  =  a,  via  the  linear  equality 
constraint  Gij  =  ltlj  cos  a.  More  generally,  we  can  impose  a  lower  and  upper 
bound  on  an  angle,  a  <  6tj  <  /?,  by  the  constraint 


lilj  cos  a  >  Gij  >  Idj  cos  ffi 

which  is  a  pair  of  linear  inequalities  on  G.  (Here  we  use  the  fact  that  cos'1  is 
monotone  decreasing.)  We  can  maximize  or  minimize  a  particular  angle  dij ,  by 
minimizing  or  maximizing  Gy  (again  using  nronotonicity  of  cos'1). 
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In  a  similar  way  we  can  impose  constraints  on  the  distances.  To  require  that 
dtj  lies  in  an  interval,  we  use 


-‘'min  _  ^ij 


dz  • 

^min 


<  d%j  <  d 


2 

max 


Jl  /  ll  ,  7  2  _  cy(~i 

umin  — 


<d 


2 

max’ 


which  is  a  pair  of  linear  inequalities  on  G.  We  can  minimize  or  maximize  a  distance, 
by  minimizing  or  maximizing  its  square,  which  is  an  affine  function  of  G. 

As  a  simple  example,  suppose  we  are  given  ranges  ( i.e .,  an  interval  of  possible 
values)  for  some  of  the  angles  and  some  of  the  distances.  We  can  then  find  the 
minimum  and  maximum  possible  value  of  some  other  angle,  or  some  other  distance, 
over  all  configurations,  by  solving  two  SDPs.  We  can  reconstruct  the  two  extreme 
configurations  by  factoring  the  resulting  optimal  Gram  matrices. 


Singular  value  and  condition  number  constraints 

The  singular  values  of  A,  >  •  •  •  >  er„,  are  the  squareroots  of  the  eigenvalues 
Ai  >  •  •  •  >  A„  of  G.  Therefore  er2  is  a  convex  function  of  G,  and  cr2  is  a  concave 
function  of  G.  Thus  we  can  impose  an  upper  bound  on  the  maximum  singular  value 
of  A,  or  minimize  it;  we  can  impose  a  lower  bound  on  the  minimum  singular  value, 
or  maximize  it.  The  condition  number  of  A,  oq/ cr n,  is  a  quasiconvex  function  of  G, 
so  we  can  impose  a  maximum  allowable  value,  or  minimize  it  over  all  configurations 
that  satisfy  the  other  geometric  constraints,  by  quasiconvex  optimization. 

Roughly  speaking,  the  constraints  we  can  impose  as  convex  constraints  on  G 
are  those  that  require  a\ , . . .  ,  an  to  be  a,  well  conditioned  basis. 


Dual  basis 


When  G  y  0,  a\, . . , ,  an  form  a  basis  for  R". 
where 


bi  «j  = 


The  associated  dual  basis  is  bi , . 
i  =  3 


.,b 


ni 


The  dual  basis  vectors  bi,...,bn  are  simply  the  rows  of  the  matrix  A-1.  As  a 
result,  the  Gram  matrix  associated  with  the  dual  basis  is  G-1. 

We  can  express  several  geometric  conditions  on  the  dual  basis  as  convex  con¬ 
straints  on  G.  The  (squared)  lengths  of  the  dual  basis  vectors, 


INI!  =  efG-'e, 


are  convex  functions  of  G,  and  so  can  be  minimized.  The  trace  of  G  ,  another 
convex  function  of  G,  gives  the  sum  of  the  squares  of  the  lengths  of  the  dual  basis 
vectors  (and  is  another  measure  of  a  well  conditioned  basis). 


Ellipsoid  and  simplex  volume 

The  volume  of  the  ellipsoid  {Au  \  ||u||2  <  1},  which  gives  another  measure  of  how 
well  conditioned  the  basis  is,  is  given  by 

7(det(ATA))1^2  =  7(detG)1/2, 
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where  7  is  the  volume  of  the  unit  ball  in  Rn.  The  log  volume  is  therefore  log  7  + 
(1/2)  log  det  G,  which  is  a  concave  function  of  G.  We  can  therefore  maximize  the 
volume  of  the  image  ellipsoid,  over  a  convex  set  of  configurations,  by  maximizing 
log  det  G. 

The  same  holds  for  any  set  in  R".  The  volume  of  the  image  under  A  is  its 
volume,  multiplied  by  the  factor  (det  G)1/2.  For  example,  consider  the  image  under 
A  of  the  unit  simplex  conv{0,  ei, . . . ,  enj,  i.e.,  the  simplex  conv{0,  «i, . . . ,  anj. 
The  volume  of  this  simplex  is  given  by  7(detG)1/2,  where  7  is  the  volume  of  the 
unit  simplex  in  Rn.  We  can  maximize  the  volume  of  this  simplex  by  maximizing 
log  det  G. 


8.3.2  Problems  involving  angles  only 

Suppose  we  only  care  about  the  angles  (or  correlation  coefficients)  between  the 
vectors,  and  do  not  specify  the  lengths  or  distances  between  them.  In  this  case  it  is 
intuitively  clear  that  we  can  simply  assume  the  vectors  have  length  l-t  =  1.  This 
is  easily  verified:  The  Gram  matrix  has  the  form  G  =  diag(l)C  diag(/),  where  l 
is  the  vector  of  lengths,  and  C  is  the  correlation  matrix,  i.e.,  Cij  =  cos 0tj .  It 
follows  that  if  G  >7  0  for  any  set  of  positive  lengths,  then  G  ^  0  for  all  sets  of 
positive  lengths,  and  in  particular,  this  occurs  if  and  only  if  G  >7  0  (which  is  the 
same  as  assuming  that  all  lengths  are  one).  Thus,  a  set  of  angles  0,j  G  [0, 7r], 
i,j  =  1, . . ,  ,n  is  realizable  if  and  only  if  G  >7  0,  which  is  a  linear  matrix  inequality 
in  the  correlation  coefficients. 

As  an  example,  suppose  we  are  given  lower  and  upper  bounds  on  some  of  the 
angles  (which  is  equivalent  to  imposing  lower  and  upper  bounds  on  the  correlation 
coefficients).  We  can  then  find  the  minimum  and  maximum  possible  value  of  some 
other  angle,  over  all  configurations,  by  solving  two  SDPs. 


Example  8.3  Bounding  correlation  coefficients.  We  consider  an  example  in  R4,  where 
we  are  given 

0.6  <  P12  <  0.9,  0.8  <  pig  <  0.9,  . 

0.5  <  p24  <  0.7,  -0.8  <  p34  <  -0.4.  1  j 

To  find  the  minimum  and  maximum  possible  values  of  P14,  we  solve  the  two  SDPs 

minimize/maximize  P14 
subject  to  (8-7) 


"  1 

pl2 

Pl3 

Pl4 

P12 

1 

p23 

p24 

y  0, 

P13 

P23 

1 

P34 

Pl4 

p24 

p34 

1 

with  variables  P12,  P13,  P14,  P23,  P24,  P34-  The  minimum  and  maximum  values  (to  two 
significant  digits)  are  —0.39  and  0.23,  with  corresponding  correlation  matrices 


'  1.00 

0.60 

0.87 

-0.39  ' 

'  1.00 

0.71 

0.80 

0.23  " 

0.60 

1.00 

0.33 

0.50 

0.71 

1.00 

0.31 

0.59 

0.87 

0.33 

1.00 

-0.55 

0.80 

0.31 

1.00 

-0.40 

-0.39 

0.50 

-0.55 

1.00 

0.23 

0.59 

-0.40 

1.00 
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8.3.3  Euclidean  distance  problems 

In  a  Euclidean  distance  problem ,  we  are  concerned  only  with  the  distances  between 
the  vectors,  dtj ,  and  do  not  care  about  the  lengths  of  the  vectors,  or  about  the  angles 
between  them.  These  distances,  of  course,  are  invariant  not  only  under  orthogonal 
transformations,  but  also  translation:  The  configuration  di  =  aq+fr, . . .  ,an  =  an+b 
has  the  same  distances  as  the  original  configuration,  for  any  b  G  R”.  In  particular, 
for  the  choice 

n 

b  =  -(1/n)  ^2ai  =  -(1  /n)Al, 

i—1 

we  see  that  dj  have  the  same  distances  as  the  original  configuration,  and  also  satisfy 
1  =  0-  It  follows  that  in  a  Euclidean  distance  problem,  we  can  assume, 

without  any  loss  of  generality,  that  the  average  of  the  vectors  ai, . . .  ,an  is  zero, 
i.e.,  Al  =  0. 

We  can  solve  Euclidean  distance  problems  by  considering  the  lengths  (which 
cannot  occur  in  the  objective  or  constraints  of  a  Euclidean  distance  problem)  as 
free  variables  in  the  optimization  problem.  Here  we  rely  on  the  fact  that  there  is 
a  configuration  with  distances  d1:j  >  0  if  and  only  if  there  are  lengths  li, ...  ,ln  for 
which  G  y  0,  where  Gy  =  (if  +  If  —  dfj)/ 2. 

We  define  z  G  R"  as  Zi  =  If,  and  D  £  S"  by  Dy  =  d'f.-  (with,  of  course, 
Du  =  0).  The  condition  that  G  ^  0  for  some  choice  of  lengths  can  be  expressed  as 

G  =  (zlT  +  1  zT  -D)/2h0  for  some  zhO,  (8.8) 

which  is  an  LMI  in  D  and  z.  A  matrix  D  G  Sn,  with  nonnegative  elements, 
zero  diagonal,  and  which  satisfies  (8.8),  is  called  a  Euclidean  distance  matrix.  A 
matrix  is  a  Euclidean  distance  matrix  if  and  only  if  its  entries  are  the  squares 
of  the  Euclidean  distances  between  the  vectors  of  some  configuration.  (Given  a 
Euclidean  distance  matrix  D  and  the  associated  length  squared  vector  z,  we  can 
reconstruct  one,  or  all,  configurations  with  the  given  pairwise  distances  using  the 
method  described  above.) 

The  condition  (8.8)  turns  out  to  be  equivalent  to  the  simpler  condition  that  D 
is  negative  semidefinite  on  1^,  i.e., 

(8.8)  uT Du  <  0  for  all  u  with  1  Tu  =  0 

(I-  (l/n)llT)D(I  -  (l/n)llT)  A  0. 

This  simple  matrix  inequality,  along  with  Dl:j  >  0,  Dvl  =  0,  is  the  classical  char¬ 
acterization  of  a  Euclidean  distance  matrix.  To  see  the  equivalence,  recall  that  we 
can  assume  Al  =  0,  which  implies  that  lrGl  =  lTATAl  =  0.  It  follows  that 
G  y  0  if  and  only  if  G  is  positive  semidefinite  on  1^,  i.e., 

0  A  (/- (l/n)llT)G(J- (l/?r)llT) 

=  (1/2) (I  -  (l/n)llT)(zlT  +  1  zT  -  D)(I  -  (l/n)llT) 

=  (1/2) (A  -  (l/n)ll t)D(I  (l/n)llT), 


which  is  the  simplified  condition. 


410 


8  Geometric  problems 


In  summary,  a  matrix  D  £  S"  is  a  Euclidean  distance  matrix,  i.e.,  gives  the 
squared  distances  between  a  set  of  n  vectors  in  Rn,  if  and  only  if 

Da  0,  i  1,  .  .  .  ,  77,  D ij  >0,  7,  j  1,  .  .  .  ,  77, 

(I  -  {l/n)llT)D(I  -  (l/n)llT)  ^  0, 

which  is  a  set  of  linear  equalities,  linear  inequalities,  and  a  matrix  inequality  in 
D.  Therefore  we  can  express  any  Euclidean  distance  problem  that  is  convex  in  the 
squared  distances  as  a  convex  problem  with  variable  D  £  S". 


8.4  Extremal  volume  ellipsoids 

Suppose  C  C  R”  is  bounded  and  has  nonempty  interior.  In  this  section  we  consider 
the  problems  of  finding  the  maximum  volume  ellipsoid  that  lies  inside  C ,  and  the 
minimum  volume  ellipsoid  that  covers  C.  Both  problems  can  be  formulated  as 
convex  programming  problems,  but  are  tractable  only  in  special  cases. 


8.4.1  The  Lowner-John  ellipsoid 

The  minimum  volume  ellipsoid  that  contains  a  set  C  is  called  the  Lowner-John 
ellipsoid  of  the  set  C,  and  is  denoted  £y.  To  characterize  it  will  be  convenient 
to  parametrize  a  general  ellipsoid  as 

£  =  {n  |  \\Av  +  b\\2  <  1}  ,  (8.9) 

i.e.,  the  inverse  image  of  the  Euclidean  unit  ball  under  an  affine  mapping.  We  can 
assume  without  loss  of  generality  that  A  e  S++>  in  which  case  the  volume  of  £  is 
proportional  to  clet  A-1 .  The  problem  of  computing  the  minimum  volume  ellipsoid 
containing  C  can  be  expressed  as 

minimize  logdetA-1  . 

subject  to  sup„eC  \\Av  +  6||2  <  1,  '  ' 

where  the  variables  are  A  g  S"  and  b  £  R",  and  there  is  an  implicit  constraint 
A  0.  The  objective  and  constraint  functions  are  both  convex  in  A  and  b ,  so  the 
problem  (8.10)  is  convex.  Evaluating  the  constraint  function  in  (8.10),  however, 
involves  solving  a  convex  maximization  problem,  and  is  tractable  only  in  certain 
special  cases. 

Minimum  volume  ellipsoid  covering  a  finite  set 

We  consider  the  problem  of  finding  the  minimum  volume  ellipsoid  that  contains 
the  finite  set  C  =  {x±, . . .  ,xm}  C  R".  An  ellipsoid  covers  C  if  and  only  if  it 
covers  its  convex  hull,  so  finding  the  minimum  volume  ellipsoid  that  covers  C 
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is  the  same  as  finding  the  minimum  volume  ellipsoid  containing  the  polyhedron 
conv{ii, . . .  ,xm}.  Applying  (8.10),  we  can  write  this  problem  as 

minimize  log  det  A-1  .  . 

subject  to  \\Axj  +  b\\2  <  1,  i  =  1, . . . ,  m  \  ■  ) 

where  the  variables  are  A  £  Sn  and  b  £  R" ,  and  we  have  the  implicit  constraint  A  >~ 
0.  The  norm  constraints  ||Aaii  +  &||2  <  1,  i  =  1, . . . ,  m,  are  convex  inequalities  in  the 
variables  A  and  b.  They  can  be  replaced  with  the  squared  versions,  || Axt  +  b\\2  <  1, 
which  are  convex  quadratic  inequalities  in  A  and  b. 


Minimum  volume  ellipsoid  covering  union  of  ellipsoids 

Minimum  volume  covering  ellipsoids  can  also  be  computed  efficiently  for  certain 
sets  C  that  are  defined  by  quadratic  inequalities.  In  particular,  it  is  possible  to 
compute  the  Lowner-John  ellipsoid  for  a  union  or  sum  of  ellipsoids. 

As  an  example,  consider  the  problem  of  finding  the  minimum  volume  ellip¬ 
soid  £ij,  that  contains  the  ellipsoids  £-[.... , £rn  (and  therefore,  the  convex  hull  of 
their  union).  The  ellipsoids  £\ .  ...,  £rn  will  be  described  by  (convex)  quadratic 
inequalities: 

£i  =  {x  |  xT AiX  +  2 bf  x  +  Ci  <  0},  i  =  1, . . . ,  m, 


where  A*  £  S"  +  .  We  parametrize  the  ellipsoid  £ij  as 

£ij  =  {%  \  \\Ax  +  b\\2  <  1} 

=  {x  |  xtAtAx  +  2(ATb)Tx  +  bTb-  1  <  0} 


where  A  £  S”  and  b  £  Rn.  Now  we  use  a  result  from  §B.2,  that  £i  C  £V]  if  and 
only  if  there  exists  a  r  >  0  such  that 

A2  -  r A{  Ab  -  rbi  1 

( Ab  —  rbi)T  bTb  —  1  —  TCj  — 


The  volume  of  £\-]  is  proportional  to  det  A  1 ,  so  we  can  find  the  minimum  volume 
ellipsoid  that  contains  £\, . . . ,  £rn  by  solving 


minimize  log  det  A  1 
subject  to  Ti  >  0, . . . ,  Tm  >  0 

A2  -  nAi  Ab  -  Tibi 
(Ab  -  Tibi)T  bTb  -  1  -  TjCi 


A0,  i  =  1, ...  ,m, 


or,  replacing  the  variable  b  by  b  =  Ab , 


minimize  log  det  A  ~ 1 

subject  to  Ti  >  0, . . . ,  Tm  >  0 

A2  —  TiAi  b  —  Tibi  0 

(i b-Tibi)T  -1-TjCj  bT 

0  b  -A2 


A0, 


i  =  1, . . . ,  m, 


which  is  convex  in  the  variables  A2  £  S”,  b,  n,  . . . ,  rm. 
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Figure  8.3  The  outer  ellipse  is  the  boundary  of  the  Lowner-John  ellipsoid, 
i.e.,  the  minimum  volume  ellipsoid  that  encloses  the  points  xi, ...  ,xe  (shown 
as  dots),  and  therefore  the  polyhedron  V  =  convjii, . . . ,  a?6 } -  The  smaller 
ellipse  is  the  boundary  of  the  Lowner-John  ellipsoid,  shrunk  by  a  factor  of 
n  =  2  about  its  center.  This  ellipsoid  is  guaranteed  to  lie  inside  V. 


Efficiency  of  Lowner-John  ellipsoidal  approximation 

Let  be  the  Lowner-John  ellipsoid  of  the  convex  set  C  C  R",  which  is  bounded 
and  has  nonempty  interior,  and  let  Xq  be  its  center.  If  we  shrink  the  Lowner-John 
ellipsoid  by  a  factor  of  n,  about  its  center,  we  obtain  an  ellipsoid  that  lies  inside 
the  set  C\ 

x0  +  (l/n)(£ij  -  x0)  C  C  C  % 

In  other  words,  the  Lowner-John  ellipsoid  approximates  an  arbitrary  convex  set, 
within  a  factor  that  depends  only  on  the  dimension  n.  Figure  8.3  shows  a  simple 
example. 

The  factor  1  /n  cannot  be  improved  without  additional  assumptions  on  C .  Any 
simplex  in  R™,  for  example,  has  the  property  that  its  Lowner-John  ellipsoid  must 
be  shrunk  by  a  factor  n  to  fit  inside  it  (see  exercise  8.13). 

We  will  prove  this  efficiency  result  for  the  special  case  C  =  conv{a:i, . . . ,  xm}. 
We  square  the  norm  constraints  in  (8.11)  and  introduce  variables  A  =  A2  and 
b  =  Ab ,  to  obtain  the  problem 

minimize  log  det  A ~ 1  ,  . 

subject  to  xiTAxi  —  2bTxi  +  bTA~1b<l,  i  =  l,...,m. 


The  KKT  conditions  for  this  problem  are 


Yh=\ \{xiXiT  -  A_1WjtA_1)  =  A-1,  YT=  i  ^i(xi  ~  A^b)  =  0, 

A i  >  0,  XiT Axi  —  2 bT Xi  +  bT A~lb  <1,  i  =  1, . . . ,  m, 

Ai(l  —  XiTAxi  +  2 bT Xi  —  bT A^1}))  =0,  i  =  1, . . . , m. 


By  a  suitable  affine  change  of  coordinates,  we  can  assume  that  A  =  I  and  6  =  0, 
i.e.,  the  minimum  volume  ellipsoid  is  the  unit  ball  centered  at  the  origin.  The  KKT 
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conditions  then  simplify  to 

m  m 

iXiXiT  =  I,  Y  XjXj  =  0,  Aj(l  -  XiTXi)  =  0,  i  = 

2  =  1  2=1 

plus  the  feasibility  conditions  \\xi\\2  <  1  and  \  >  0.  By  taking  the  trace  of 
both  sides  of  the  first  equation,  and  using  complementary  slackness,  we  also  have 

Em  \ 

2  =  1  ^2  _  n' 

In  the  new  coordinates  the  shrunk  ellipsoid  is  a  ball  with  radius  1  /n,  centered 
at  the  origin.  We  need  to  show  that 

IMI2  <  1/n  =>■  x  £  C  =  convjaq, . . .  ,xm}. 

Suppose  ||ar|| 2  <  1/n.  From  the  KKT  conditions,  we  see  that 

mm  m 

X  =  y  A i(xTxi)xi  =  y  \i(xTXi  +  1  /n)xi  =  y  h iXi ,  (8.13) 

i— 1  i— 1  i= 1 

where  /1,;  =  \i(xTXi  +  1/n).  From  the  Cauchy- Schwartz  inequality,  we  note  that 
Hi  =  A i{xTXi  +  1/n)  >  Aj(— ||x||2||a:t||2  +  1/n)  >  A*(-l/n  +  1/n)  =  0. 
Furthermore 

mm  m 

’y^i  =  y  a i(xTXi  +  1/n)  =  'y  Xi/n  =  1. 

i=  1  2=1  2=1 

This,  along  with  (8.13),  shows  that  x  is  a  convex  combination  of  aq  , . . .  ,xm ,  hence 
xeC. 

Efficiency  of  Lowner-John  ellipsoidal  approximation  for  symmetric  sets 

If  the  set  C  is  symmetric  about  a  point  Xq,  then  the  factor  1/n  can  be  tightened 
to  1/y/n: 

xo  +  (1/ y/n)  (£ij  -  x0)  CCC£ y. 

Again,  the  factor  1  /y/n  is  tight.  The  Lowner-John  ellipsoid  of  the  cube 

C  =  {x  G  R”  |  -H18I) 

is  the  ball  with  radius  y/n.  Scaling  down  by  1  /y/n  yields  a  ball  enclosed  in  C ,  and 
touching  the  boundary  at  x  =  ie*. 

Approximating  a  norm  by  a  quadratic  norm 

Let  ||  •  ||  be  any  norm  on  Rn,  and  let  C  =  (x  |  ||x||  <  1}  be  its  unit  ball.  Let 
=  {x  |  xT Ax  <  1},  with  A  €  S"  +  ,  be  the  Lowner-John  ellipsoid  of  C.  Since  C 
is  symmetric  about  the  origin,  the  result  above  tells  us  that  ( l/y/n)£\ j  C  C  C  fy. 
Let  ||  •  ||ij  denote  the  quadratic  norm 

Ikllij  =  ( ztAz)1/ 2, 
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whose  unit  ball  is  £\y  The  inclusions  (\/y/n)£y]  C  C  C  fy  are  equivalent  to  the 
inequalities 

IMIij  <  \\z\\  <  v^INlij 

for  all  z  £  R'1.  In  other  words,  the  quadratic  norm  ||  •  ||y  approximates  the  norm 
||  •  ||  within  a  factor  of  yfn.  In  particular,  we  see  that  any  norm  on  Rn  can  be 
approximated  within  a  factor  of  y/n  by  a  quadratic  norm. 


8.4.2  Maximum  volume  inscribed  ellipsoid 


We  now  consider  the  problem  of  finding  the  ellipsoid  of  maximum  volume  that  lies 
inside  a  convex  set  C,  which  we  assume  is  bounded  and  has  nonempty  interior.  To 
formulate  this  problem,  we  parametrize  the  ellipsoid  as  the  image  of  the  unit  ball 
under  an  affine  transformation,  i.e.,  as 

£  =  {Bu  +  d  |  |M|2  <  1}  . 

Again  it  can  be  assumed  that  B  eS^+,  so  the  volume  is  proportional  to  det  B.  We 
can  find  the  maximum  volume  ellipsoid  inside  C  by  solving  the  convex  optimization 
problem 

maximize  log  det  B  , 

subject  to  sup||„i|2<1  Ic(Bu  +  d)  <  0 

in  the  variables  B  £  S"  and  d  £  Rn,  with  implicit  constraint  B  >~  0. 


Maximum  volume  ellipsoid  in  a  polyhedron 

We  consider  the  case  where  C  is  a  polyhedron  described  by  a  set  of  linear  inequal¬ 
ities: 

C  =  {x  \  af  x  <  bi,  i  =  1, ,  to} . 

To  apply  (8.14)  we  first  express  the  constraint  in  a  more  convenient  form: 

sup  af  (Bu  +  d)  <bi,  i  =  1, . . . ,  to 
IM|2<i 

\\Bai\\2  +  af  d  <  bi,  i  —  1 , . . . ,  to. 

We  can  therefore  formulate  (8.14)  as  a  convex  optimization  problem  in  the  variables 
B  and  d: 

minimize  log  det  i?-1  ,  , 

subject  to  \\BaiW2  +  af  d  <  bi,  i=l,...,m.  ' 


sup  Ic(Bu  +  d)  <  0 


2  <  1 


Maximum  volume  ellipsoid  in  an  intersection  of  ellipsoids 

We  can  also  find  the  maximum  volume  ellipsoid  £  that  lies  in  the  intersection  of 
to  ellipsoids  £]_,..., £m.  We  will  describe  £  as  £  =  {Bu  +  d  |  ||u||2  <  1}  with 
B  £  S”+,  and  the  other  ellipsoids  via  convex  quadratic  inequalities, 

£i  —  {x\  xT AiX  +  2 bf  x  +  Ci  <  0},  i  =  1, . . . ,  to, 
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where  A,  G  S” +  .  We  first  work  out  the  condition  under  which  £  C  £i.  This  occurs 
if  and  only  if 


sup  (( d  +  Bu)t  Ai(d  +  Bu )  +  2  bf  ( d  +  Bu)  +  cfj 
II«I|2<1 

=  dT  Aid  +  2  bf  d  +  Ci+  sup  (uT  BAtBu  +  2  (Aid  +  bi)T  Bu) 

I1W||2<1 

<  0. 


From  §B.l, 


sup  (uT BA.iBu  +  2{Aid  +  bi)T Bu)  <  ~(dT  A.td  +  2 bf  d  +  cf) 
II«I|2<1 


if  and  only  if  there  exists  a  Aj  >  0  such  that 

—A  i  —  dT  Aid  —  2  bf  d  —  Ci  ( Atd  +  bi)TB 
B(Aid  +  bi)  Xil  —  BAiB 


The  maximum  volume  ellipsoid  contained  in  £ i, . . . ,  £m  can  therefore  be  found  by 
solving  the  problem 


minimize 
subject  to 


log  det  B~l 

—A  i  —  dT  Aid  —  2  bf  d  —  Cj 

B{Aid  +  bi) 


(Aid  +  bi)T  B 
A iJ  -  BAiB 


to: 


i  =  1, . . .  ,m, 


with  variables  B  G  Sn,  d  £  R",  and  A  G  Rm,  or,  equivalently, 

minimize  log  det  l?-1 

f  —A  i  —  Ci  +  bf  Aj  1bi  0  (d  +  Ai  lbi)T  1 


subject  to 


0  XJ  B 

d  +  A~1bi  B  A~x 


to, 


i  =  1, . . . ,  m. 


Efficiency  of  ellipsoidal  inner  approximations 

Approximation  efficiency  results,  similar  to  the  ones  for  the  Lowner-John  ellipsoid, 
hold  for  the  maximum  volume  inscribed  ellipsoid.  If  C  C  R"  is  convex,  bounded, 
with  nonempty  interior,  then  the  maximum  volume  inscribed  ellipsoid,  expanded 
by  a  factor  of  n  about  its  center,  covers  the  set  C.  The  factor  n  can  be  tightened 
to  \Jn  if  the  set  C  is  symmetric  about  a  point.  An  example  is  shown  in  figure  8.4. 


8.4.3  Affine  invariance  of  extremal  volume  ellipsoids 

The  Lowner-John  ellipsoid  and  the  maximum  volume  inscribed  ellipsoid  are  both 
affinely  invariant.  If  is  the  Lowner-John  ellipsoid  of  C,  and  T  G  Rnxn  is 
nonsingular,  then  the  Lowner-John  ellipsoid  of  TC  is  T£ y.  A  similar  result  holds 
for  the  maximum  volume  inscribed  ellipsoid. 

To  establish  this  result,  let  £  be  any  ellipsoid  that  covers  C.  Then  the  ellipsoid 
T£  covers  TC.  The  converse  is  also  true:  Every  ellipsoid  that  covers  TC  has 
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Figure  8.4  The  maximum  volume  ellipsoid  (shown  shaded)  inscribed  in  a 
polyhedron  V.  The  outer  ellipse  is  the  boundary  of  the  inner  ellipsoid, 
expanded  by  a  factor  n  =  2  about  its  center.  The  expanded  ellipsoid  is 
guaranteed  to  cover  V  ■ 


the  form  T£ ,  where  £  is  an  ellipsoid  that  covers  C.  In  other  words,  the  relation 
£  =  T£  gives  a  one-to-one  correspondence  between  the  ellipsoids  covering  TC  and 
the  ellipsoids  covering  C .  Moreover,  the  volumes  of  the  corresponding  ellipsoids  are 
all  related  by  the  ratio  |  detT|,  so  in  particular,  if  £  has  minimum  volume  among 
ellipsoids  covering  C,  then  T£  has  minimum  volume  among  ellipsoids  covering  TC. 


8.5  Centering 

8.5.1  Chebyshev  center 

Let  C  C  R™  be  bounded  and  have  nonempty  interior,  and  x  €  C.  The  depth  of  a 
point  x  £  C  is  defined  as 


depth(cc,  C)  =  dist(x,  R"  \  C), 

i.e.,  the  distance  to  the  closest  point  in  the  exterior  of  C.  The  depth  gives  the 
radius  of  the  largest  ball,  centered  at  x,  that  lies  in  C.  A  Chebyshev  center  of  the 
set  C  is  defined  as  any  point  of  maximum  depth  in  C : 

£cheb(C)  =  argmaxdepth(a;,  C)  =  argmaxdist(x,  R”  \  C). 


A  Chebyshev  center  is  a  point  inside  C  that  is  farthest  from  the  exterior  of  C;  it  is 
also  the  center  of  the  largest  ball  that  lies  inside  C.  Figure  8.5  shows  an  example, 
in  which  C  is  a  polyhedron,  and  the  norm  is  Euclidean. 
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Figure  8.5  Chebyshev  center  of  a  polyhedron  C,  in  the  Euclidean  norm.  The 
center  xcheb  is  the  deepest  point  inside  C,  in  the  sense  that  it  is  farthest  from 
the  exterior,  or  complement,  of  C.  The  center  a;cheb  is  also  the  center  of  the 
largest  Euclidean  ball  (shown  lightly  shaded)  that  lies  inside  C. 


Chebyshev  center  of  a  convex  set 

When  the  set  C  is  convex,  the  depth  is  a  concave  function  for  x  £  C,  so  computing 
the  Chebyshev  center  is  a  convex  optimization  problem  (see  exercise  8.5).  More 
specifically,  suppose  C  C  R”  is  defined  by  a  set  of  convex  inequalities: 

C  =  {X  |  fi(x)  <  0,  .  .  <  0}. 

We  can  find  a  Chebyshev  center  by  solving  the  problem 

maximize  R  rs  irl 

subject  to  gi(x,R)<  0,  i  =  '  ’ 

where  gi  is  defined  as 

gi(x,  R)  =  sup  fi(x  +  Ru). 

IMI<i 

Problem  (8.16)  is  a  convex  optimization  problem,  since  each  function  gi  is  the 
pointwise  maximum  of  a  family  of  convex  functions  of  x  and  R,  hence  convex. 
However,  evaluating  g^  involves  solving  a  convex  maximization  problem  (either 
numerically  or  analytically),  which  may  be  very  hard.  In  practice,  we  can  find  the 
Chebyshev  center  only  in  cases  where  the  functions  gi  are  easy  to  evaluate. 

Chebyshev  center  of  a  polyhedron 

Suppose  C  is  defined  by  a  set  of  linear  inequalities  af  x  <  bi,  i  =  1, . . . ,  to.  We 
have 

gi(x,  R)  =  sup  aj ( x  +  Ru)  —  bi  =  ajx  +  i?||aj||»  —  6,; 

IMI<i 
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if  R  >  0,  so  the  Chebysliev  center  can  be  found  by  solving  the  LP 
maximize  R 

subject  to  aj x  +  R\\ai\\*  <bi,  i  = 

R  >  0 

with  variables  x  and  R. 

Euclidean  Chebyshev  center  of  intersection  of  ellipsoids 

Let  C  be  an  intersection  of  to  ellipsoids,  defined  by  quadratic  inequalities, 

C  =  {x  |  xT AiX  +  2 bfx  +  a  <  0,  «  =  to}, 


where  Ai  £  S”  +  .  We  have 


gi(x,R)  =  sup  ((x  +  Ru)TAi(x  +  Ru)  +  2bf (x  +  Ru)  +  Cj) 

IMl2<i 

=  xT AiX  +  2 bfx  +  d  +  sup  (R2ut AiU  +  2 R{AiX  +  bi)Tu)  . 

IMh<i 

From  §B.l,  gi(x,R)  <  0  if  and  only  if  there  exists  a  A,  such  that  the  matrix 
inequality 

- xTAiXi  -  2 bf  x  —  Ci  —  A i  R(AiX  +  bi)T  1 

R(AiX  +  b i)  A  iI-R2Ai  J-  1  j 

holds.  Using  this  result,  we  can  express  the  Chebyshev  centering  problem  as 


maximize 
subject  to 


R 

'  -A i-d  +  bfA^bi 

0 

x  -f  At  bi 


0  (x  +  Ai  1bi)T 
A  J  RI 

RI  Ar1 


to, 


i  =  1, . . . ,  to, 


which  is  an  SDP  with  variables  R,  A,  and  x.  Note  that  the  Sclrur  complement  of 
A~x  in  the  LMI  constraint  is  equal  to  the  lefthand  side  of  (8.17). 


8.5.2  Maximum  volume  ellipsoid  center 

The  Chebyshev  center  xcheb  of  a  set  C  C  R"  is  the  center  of  the  largest  ball  that 
lies  in  C .  As  an  extension  of  this  idea,  we  define  the  maximum  volume  ellipsoid 
center  of  C,  denoted  xmve,  as  the  center  of  the  maximum  volume  ellipsoid  that  lies 
in  C .  Figure  8.6  shows  an  example,  where  C  is  a  polyhedron. 

The  maximum  volume  ellipsoid  center  is  readily  computed  when  C  is  defined 
by  a  set  of  linear  inequalities,  by  solving  the  problem  (8.15).  (The  optimal  value 
of  the  variable  d  £  R"  is  xmve-)  Since  the  maximum  volume  ellipsoid  inside  C  is 
affine  invariant,  so  is  the  maximum  volume  ellipsoid  center. 
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Figure  8.6  The  lightly  shaded  ellipsoid  shows  the  maximum  volume  ellipsoid 
contained  in  the  set  C,  which  is  the  same  polyhedron  as  in  figure  8.5.  Its 
center  £mve  is  the  maximum  volume  ellipsoid  center  of  C. 


8.5.3  Analytic  center  of  a  set  of  inequalities 

The  analytic  center  xac  of  a  set  of  convex  inequalities  and  linear  equalities, 

fi(x)  <  0,  i  =  Fx  =  g 

is  defined  as  an  optimal  point  for  the  (convex)  problem 

minimize  -  YT=  i  log (~fi(x))  ,  , 

subject  to  Fx  =  g, 

with  variable  x  £  R"  and  implicit  constraints  ft(x)  <  0,  i  =  1, . . . ,  m.  The  objec¬ 
tive  in  (8.18)  is  called  the  logarithmic  barrier  associated  with  the  set  of  inequalities. 
We  assume  here  that  the  domain  of  the  logarithmic  barrier  intersects  the  affine  set 
defined  by  the  equalities,  i.e.,  the  strict  inequality  system 

fi(x)  <  0,  i  =  l,...,m,  Fx  =  g 

is  feasible.  The  logarithmic  barrier  is  bounded  below  on  the  feasible  set 

C  =  {x  |  fi(x)  <0,  i  =  Fx  =  g}, 


if  C  is  bounded. 

When  x  is  strictly  feasible,  i.e.,  Fx  =  g  and  fi(x)  <  0  for  i  =  1, ... ,  m,  we  can 
interpret  —fi{x)  as  the  margin  or  slack  in  the  ith  inequality.  The  analytic  center 
xac  is  the  point  that  maximizes  the  product  (or  geometric  mean)  of  these  slacks  or 
margins,  subject  to  the  equality  constraints  Fx  =  g,  and  the  implicit  constraints 
fi{x )  <  0. 

The  analytic  center  is  not  a  function  of  the  set  C  described  by  the  inequalities 
and  equalities;  two  sets  of  inequalities  and  equalities  can  define  the  same  set,  but 
have  different  analytic  centers.  Still,  it  is  not  uncommon  to  informally  use  the 
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term  ‘analytic  center  of  a  set  C’  to  mean  the  analytic  center  of  a  particular  set  of 
equalities  and  inequalities  that  define  it. 

The  analytic  center  is,  however,  independent  of  affine  changes  of  coordinates. 
It  is  also  invariant  under  (positive)  scalings  of  the  inequality  functions,  and  any 
reparametrization  of  the  equality  constraints.  In  other  words,  if  F  and  g  are  such 
that  Fx  =  g  if  and  only  if  Fx  =  g ,  and  oq, . . . ,  am  >  0,  then  the  analytic  center  of 

otifi(x)  <0,  i  =  1, . . ,,  m,  Fx  =  g , 

is  the  same  as  the  analytic  center  of 

fi{ x)  <  0,  i  =  1, . . .  ,m,  Fx  =  g 

(see  exercise  8.17). 

Analytic  center  of  a  set  of  linear  inequalities 

The  analytic  center  of  a  set  of  linear  inequalities 

ajx  <  bi,  i  =  1,  . . .  ,  to, 

is  the  solution  of  the  unconstrained  minimization  problem 

minimize  -  Y^iLi  log {bi  -  aj x),  (8.19) 

with  implicit  constraint  bi  —  ajx  >  0,  i  =  1, . . , ,  m.  If  the  polyhedron  defined  by 
the  linear  inequalities  is  bounded,  then  the  logarithmic  barrier  is  bounded  below 
and  strictly  convex,  so  the  analytic  center  is  unique.  (See  exercise  4.2.) 

We  can  give  a  geometric  interpretation  of  the  analytic  center  of  a  set  of  linear 
inequalities.  Since  the  analytic  center  is  independent  of  positive  scaling  of  the 
constraint  functions,  we  can  assume  without  loss  of  generality  that  ||cti  || 2  =  1  -  In 
this  case,  the  slack  6,;  —  afx  is  the  distance  to  the  hyperplane  Hi  =  {x  \  af  x  = 
bj}.  Therefore  the  analytic  center  xac  is  the  point  that  maximizes  the  product  of 
distances  to  the  defining  hyperplanes. 

Inner  and  outer  ellipsoids  from  analytic  center  of  linear  inequalities 

The  analytic  center  of  a  set  of  linear  inequalities  implicitly  defines  an  inscribed  and 
a  covering  ellipsoid,  defined  by  the  Hessian  of  the  logarithmic  barrier  function 

m 

~^2^og(bi  -  ajx), 

i= 1 

evaluated  at  the  analytic  center,  ie., 

m  ^ 

H  =y  cffaidj ,  di  =  - ^ i  = 

bi~ai 

We  have  dinner  CPC  £ outer,  where 

V  =  {x  |  afx  <  bi,  i  =  1, . . . ,  m}, 
dinner  |  (x  Xac)  F[{x  Xac)  ^  1}, 

Pouter  =  {X  \  X  -  Xac)TU(x  -  Xac)  <  m(m  -  1)}. 
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Figure  8.7  The  dashed  lines  show  five  level  curves  of  the  logarithmic  barrier 
function  for  the  inequalities  defining  the  polyhedron  C  in  figure  8.5.  The 
minimizer  of  the  logarithmic  barrier  function,  labeled  xac,  is  the  analytic 
center  of  the  inequalities.  The  inner  ellipsoid  firmer  =  {a;  |  (a:  —  zac )H(x  — 
Xac)  <  1},  where  H  is  the  Hessian  of  the  logarithmic  barrier  function  at  xac, 
is  shaded. 


This  is  a  weaker  result  than  the  one  for  the  maximum  volume  inscribed  ellipsoid, 
which  when  scaled  up  by  a  factor  of  n  covers  the  polyhedron.  The  inner  and  outer 
ellipsoids  defined  by  the  Hessian  of  the  logarithmic  barrier,  in  contrast,  are  related 
by  the  scale  factor  (m(m  —  l))1^2,  which  is  always  at  least  n. 

To  show  that  £jnner  C  V,  suppose  x  €  £inner,  *.e., 

m 

( X  Zac)  H(x  Zac)  —  ^  ) ( d/O.,  ( X  Zac))  fir  f • 

i=l 


This  implies  that 

of  ( X  -  Zac)  <  1 M  =  bi  -  of  Zac,  *  =  1 ,  .  •  •  ,  TO, 

and  therefore  af x  <  bi  for  i  =  l, ...  ,m.  (We  have  not  used  the  fact  that  zac  is 
the  analytic  center,  so  this  result  is  valid  if  we  replace  zac  with  any  strictly  feasible 
point.) 

To  establish  that  V  C  £outer,  we  will  need  the  fact  that  zac  is  the  analytic 
center,  and  therefore  the  gradient  of  the  logarithmic  barrier  vanishes: 

m 

Y.  diai  =  0. 

i= 1 


Now  assume  z  £  V .  Then 

(z  -  Zac )T H{x  -  Zac) 

m 

=  ^  ](di(lj  ( X  ^ac)) 

i= 1 
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< 


y: d2i(l/di  —  af(x  —  xac))2  -  TO 
i= 1 
m 

afx)2  ~m 

m  \  ^ 

Judith -  of 


i= 1 


x )  l  —  m 


—  (  ^  ^  dii^bi  a i  xac)  +  ^  didi  (j. 


=  m  —  m, 


i=i 


—  m 


which  shows  that  x  G  fouter-  (The  second  equality  follows  from  the  fact  that 

YhLi  diai  =  The  inequality  follows  from  Y.'iLiVi  <  CHi  2/*) 2  for  y  >z  0.  The 
last  equality  follows  from  ^"=1  didi  =  0,  and  the  definition  of  d,.) 

Analytic  center  of  a  linear  matrix  inequality 

The  definition  of  analytic  center  can  be  extended  to  sets  described  by  generalized 
inequalities  with  respect  to  a  cone  K .  if  we  define  a  logarithm  on  K .  For  example, 
the  analytic  center  of  a  linear  matrix  inequality 


a:i.Ai  +  X2A2  +  •  •  •  +  xnAn  ^  B 


is  defined  as  the  solution  of 


minimize  —  logdet(R  —  X\A\  —  •  •  •  —  xnAn). 


8.6  Classification 


In  pattern  recognition  and  classification  problems  we  are  given  two  sets  of  points 
in  R™,  {cci , . . . ,  Xn}  and  {y\, . . . ,  Pm},  and  wish  to  find  a  function  /  :  R”  — >  R 
(within  a  given  family  of  functions)  that  is  positive  on  the  first  set  and  negative  on 
the  second,  i.e., 

f{Xi)>  0,  *  =  1 . JV,  f(Vi)<  0,  /•  1 . M. 

If  these  inequalities  hold,  we  say  that  /,  or  its  0-level  set  {x  |  f(x)  =  0},  separates, 
classifies,  or  discriminates  the  two  sets  of  points.  We  sometimes  also  consider  weak 
separation ,  in  which  the  weak  versions  of  the  inequalities  hold. 
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Figure  8.8  The  points  xi, . . .  ,xn  are  shown  as  open  circles,  and  the  points 
yi, ...  ,2/m  are  shown  as  filled  circles.  These  two  sets  are  classified  by  an 
affine  function  /,  whose  0-level  set  (a  line)  separates  them. 


8.6.1  Linear  discrimination 

In  linear  discrimination,  we  seek  an  affine  function  f(x)  =  aTx  —  b  that  classifies 
the  points,  i.e., 

aTXi~b>  0,  i  =  l,...,N,  aTy.i  —  b  <  0,  i  =  (8.20) 

Geometrically,  we  seek  a  hyperplane  that  separates  the  two  sets  of  points.  Since 
the  strict  inequalities  (8.20)  are  homogeneous  in  a  and  6,  they  are  feasible  if  and 
only  if  the  set  of  nonstrict  linear  inequalities 

aTXi  —  b  >  1,  i  =  l,...,N,  aTyi  —  b<—  1,  i  =  (8.21) 

(in  the  variables  a,  b)  is  feasible.  Figure  8.8  shows  a  simple  example  of  two  sets  of 
points  and  a  linear  discriminating  function. 

Linear  discrimination  alternative 

The  strong  alternative  of  the  set  of  strict  inequalities  (8.20)  is  the  existence  of  A, 
A  such  that 

N  M 

A^0,  A^0,  (A,A)^0,  J2XiXi=^2~X 1TA  =  1tA  (8.22) 

2  =  1  2=1 

(see  §5.8.3).  Using  the  third  and  last  conditions,  we  can  express  these  alternative 
conditions  as 

N  M 

A  h  0,  1tA  =  1,  A  y  0,  1tA  =  1,  XjXj  =  A iUi 

2=1  2=1 
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(by  dividing  by  lrA,  which  is  positive,  and  using  the  same  symbols  for  the  normal¬ 
ized  A  and  A).  These  conditions  have  a  simple  geometric  interpretation:  They  state 
that  there  is  a  point  in  the  convex  hull  of  both  {2:1, . . . ,  Xn}  and  {y\, . . . ,  yM}-  In 
other  words:  the  two  sets  of  points  can  be  linearly  discriminated  ( i.e .,  discrimi¬ 
nated  by  an  affine  function)  if  and  only  if  their  convex  hulls  do  not  intersect.  We 
have  seen  this  result  several  times  before. 


Robust  linear  discrimination 

The  existence  of  an  affine  classifying  function  f(x)  =  aTx  —  b  is  equivalent  to  a 
set  of  linear  inequalities  in  the  variables  a  and  b  that  define  /.  If  the  two  sets 
can  be  linearly  discriminated,  then  there  is  a  polyhedron  of  affine  functions  that 
discriminate  them,  and  we  can  choose  one  that  optimizes  some  measure  of  robust¬ 
ness.  We  might,  for  example,  seek  the  function  that  gives  the  maximum  possible 
‘gap’  between  the  (positive)  values  at  the  points  Xi  and  the  (negative)  values  at  the 
points  yi .  To  do  this  we  have  to  normalize  a  and  b,  since  otherwise  we  can  scale  a 
and  b  by  a  positive  constant  and  make  the  gap  in  the  values  arbitrarily  large.  This 
leads  to  the  problem 


maximize  t 

subject  to  aTXi  —  b  >t,  i  =  1, . . . ,  N 
aTyi~b<~t ,  i  =  l,...,M 

IHh  <  1, 

with  variables  a,  b ,  and  t.  The  optimal  value  t *  of  this  convex  problem  (with 
linear  objective,  linear  inequalities,  and  one  quadratic  inequality)  is  positive  if 
and  only  if  the  two  sets  of  points  can  be  linearly  discriminated.  In  this  case  the 
inequality  ||a||2  <  1  is  always  tight  at  the  optimum,  i.e.,  we  have  || a.* || 2  =  1-  (See 
exercise  8.23.) 

We  can  give  a  simple  geometric  interpretation  of  the  robust  linear  discrimination 
problem  (8.23).  If  || a,|| 2  =  1  (as  is  the  case  at  any  optimal  point),  aTXi  —  b  is  the 
Euclidean  distance  from  the  point  Xi  to  the  separating  hyperplane  H  =  {z  \  aTz  = 
6}.  Similarly,  b—aTyi  is  the  distance  from  the  point  yi  to  the  lryperplane.  Therefore 
the  problem  (8.23)  finds  the  hyperplane  that  separates  the  two  sets  of  points,  and 
has  maximal  distance  to  the  sets.  In  other  words,  it  finds  the  thickest  slab  that 
separates  the  two  sets. 

As  suggested  by  the  example  shown  in  figure  8.9,  the  optimal  value  t*  (which  is 
half  the  slab  thickness)  turns  out  to  be  half  the  distance  between  the  convex  hulls 
of  the  two  sets  of  points.  This  can  be  seen  clearly  from  the  dual  of  the  robust  linear 
discrimination  problem  (8.23).  The  Lagrangian  (for  the  problem  of  minimizing  —t) 
is 

JV  M 

-t  +  Y,Ui(t  +  b-  aTXi)  +  y ^Vj(t  -  b  +  aTy.i)  +  A(||a||2  -  1). 

i= 1  i=l 

Minimizing  over  b  and  t  yields  the  conditions  lTit  =  1/2,  lTi>  =  1/2.  When  these 
hold,  we  have 


'A"-  X 


+  A||a||2-A 


g{u,v,  A)  =  inf 
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Figure  8.9  By  solving  the  robust  linear  discrimination  problem  (8.23)  we 
find  an  affine  function  that  gives  the  largest  gap  in  values  between  the  two 
sets  (with  a  normalization  bound  on  the  linear  part  of  the  function).  Ge¬ 
ometrically,  we  are  finding  the  thickest  slab  that  separates  the  two  sets  of 
points. 


Em  \  n 

»= 1  viVi  ~  Ei=l  uiXi 

otherwise. 


<  A 

2 


The  dual  problem  can  then  be  written  as 


maximize 
subject  to 


Em  n 

»= i  viVi  -  Ei= i  uixi 

w^O,  lTu  =1/2 
v  h  0,  lTv  =  1/2. 


We  can  interpret  2E f—iuix%  as  a  point  in  the  convex  hull  of  {aq, . . . , xjv}  and 
2  Ef=i  viVi  as  a  point  in  the  convex  hull  of  {z/i , . . . ,  ijm}-  The  dual  objective  is  to 
minimize  (half)  the  distance  between  these  two  points,  ie.,  find  (half)  the  distance 
between  the  convex  hulls  of  the  two  sets. 


Support  vector  classifier 

When  the  two  sets  of  points  cannot  be  linearly  separated,  we  might  seek  an  affine 
function  that  approximately  classifies  the  points,  for  example,  one  that  minimizes 
the  number  of  points  misclassified.  Unfortunately,  this  is  in  general  a  difficult 
combinatorial  optimization  problem.  One  heuristic  for  approximate  linear  discrim¬ 
ination  is  based  on  support  vector  classifiers ,  which  we  describe  in  this  section. 

We  start  with  the  feasibility  problem  (8.21).  We  first  relax  the  constraints 
by  introducing  nonnegative  variables  u\ , . . . ,  rtjv  and  V\ , . . . ,  um ,  and  forming  the 
inequalities 

aTXi  —  b>  1  —Ui, 


aTyi-b  <  -{l- Vi),  i  =  (8.24) 
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Figure  8.10  Approximate  linear  discrimination  via  linear  programming.  The 
points  xi, . . .  ,*5o,  shown  as  open  circles,  cannot  be  linearly  separated  from 
the  points  yi, ,  y$o,  shown  as  filled  circles.  The  classifier  shown  as  a  solid 
line  was  obtained  by  solving  the  LP  (8.25).  This  classifier  misclassifies  one 
point.  The  dashed  lines  are  the  hyperplanes  aT z  —  b  =  ±1.  Four  points  are 
correctly  classified,  but  lie  in  the  slab  defined  by  the  dashed  lines. 


When  u  =  v  =  0,  we  recover  the  original  constraints;  by  making  u  and  v  large 
enough,  these  inequalities  can  always  be  made  feasible.  We  can  think  of  ut  as 
a  measure  of  how  much  the  constraint  aTXi  —  b  >  1  is  violated,  and  similarly 
for  vt.  Our  goal  is  to  find  a,  b,  and  sparse  nonnegative  u  and  v  that  satisfy  the 
inequalities  (8.24).  As  a  heuristic  for  this,  we  can  minimize  the  sum  of  the  variables 
Ui  and  Vi,  by  solving  the  LP 

minimize  1 1  u  +  l1  v 

subject  to  aT  Xi  —  b>  l  —  Ui,  i  =  1, . . . ,  N  ..... 

aTyi  -  b  <  -(1  -  v^,  i  =  1, . . .  ,M 
w  ^  0,  v  y  0. 

Figure  8.10  shows  an  example.  In  this  example,  the  affine  function  aT z  —  b  mis¬ 
classifies  1  out  of  100  points.  Note  however  that  when  0  <  iq  <  1,  the  point  aq 
is  correctly  classified  by  the  affine  function  aT z  —  b ,  but  violates  the  inequality 
aT Xi  —  b  >  1,  and  similarly  for  yi.  The  objective  function  in  the  LP  (8.25)  can  be 
interpreted  as  a  relaxation  of  the  number  of  points  Xi  that  violate  aTXi  —  b  >  1  plus 
the  number  of  points  yi  that  violate  aTyi~b  <  —  1.  In  other  words,  it  is  a  relaxation 
of  the  number  of  points  misclassified  by  the  function  aT z  —  b ,  plus  the  number  of 
points  that  are  correctly  classified  but  fie  in  the  slab  defined  by  —  1  <  aTz  —  b  <  1. 

More  generally,  we  can  consider  the  trade-off  between  the  number  of  misclas¬ 
sified  points,  and  the  width  of  the  slab  {z  \  —  1  <  aTz  —  b  <  1},  which  is 

given  by  2/||a||2-  The  standard  support  vector  classifier  for  the  sets  {aq, . . .  ,xjy}, 
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Figure  8.11  Approximate  linear  discrimination  via  support  vector  classifier, 
with  7  =  0.1.  The  support  vector  classifier,  shown  as  the  solid  line,  misclas- 
sifies  three  points.  Fifteen  points  are  correctly  classified  but  lie  in  the  slab 
defined  by  —  1  <  aT z  —  b  <  1,  bounded  by  the  dashed  lines. 


{ yi , . . . ,  i/m}  is  defined  as  the  solution  of 

minimize  ||a||2  +  7(1  Tu  +  lTv) 

subject  to  aTXi  —  b  >  1  —  Uj,  i  =  1, . . . ,  N 

aTyi  -  b  <  -(1  -  vi),  i  =  1, . . .  ,M 
u  >z  0,  v  0, 

The  first  term  is  proportional  to  the  inverse  of  the  width  of  the  slab  defined  by 
—  1  <  aT z  —  b  <  1.  The  second  term  has  the  same  interpretation  as  above,  i.e .,  it 
is  a  convex  relaxation  for  the  number  of  misclassified  points  (including  the  points 
in  the  slab).  The  parameter  7,  which  is  positive,  gives  the  relative  weight  of  the 
number  of  misclassified  points  (which  we  want  to  minimize),  compared  to  the  width 
of  the  slab  (which  we  want  to  maximize).  Figure  8.11  shows  an  example. 

Approximate  linear  discrimination  via  logistic  modeling 

Another  approach  to  finding  an  affine  function  that  approximately  classifies  two 
sets  of  points  that  cannot  be  linearly  separated  is  based  on  the  logistic  model 
described  in  §7.1.1.  We  start  by  fitting  the  two  sets  of  points  with  a  logistic  model. 
Suppose  2  is  a  random  variable  with  values  0  or  1,  with  a  distribution  that  depends 
on  some  (deterministic)  explanatory  variable  u  €  R",  via  a  logistic  model  of  the 
form 

prob(2  =  1)  =  (exp(aTu  -  &))/(  1  +  exp(aTu  -  b )) 
prob(2  =  0)  =  1/(1  +  exp(aTu  —  b)). 

Now  we  assume  that  the  given  sets  of  points,  {27, . . . ,  Xn}  and  { yi , . . . ,  IJm}- 
arise  as  samples  from  the  logistic  model.  Specifically,  {xi, . . .  ,Xn}  are  the  values 
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of  u  for  the  N  samples  for  which  z  =  1,  and  {yi , . . . ,  ijm}  are  the  values  of  u  for 
the  M  samples  for  which  z  =  0.  (This  allows  us  to  have  x,  =  yj ,  which  would  rule 
out  discrimination  between  the  two  sets.  In  a  logistic  model,  it  simply  means  that 
we  have  two  samples,  with  the  same  value  of  explanatory  variable  but  different 
outcomes.) 

We  can  determine  a  and  b  by  maximum  likelihood  estimation  from  the  observed 
samples,  by  solving  the  convex  optimization  problem 

minimize  —l(a,b)  (8.27) 

with  variables  a,  b ,  where  l  is  the  log-likelihood  function 
l{a,b)  =  J2iLi(aTxi  -  b) 

-  Eili  log(!  +  exp(aTa :»  -  b ))  -  Yh=x  log(l  +  exp (aTyi  -  b)) 

(see  §7.1.1).  If  the  two  sets  of  points  can  be  linearly  separated,  i.e.,  if  there  exist  a, 
b  with  aTXi  >  b  and  aTyi  <  b,  then  the  optimization  problem  (8.27)  is  unbounded 
below. 

Once  we  find  the  maximum  likelihood  values  of  a  and  b,  we  can  form  a  linear 
classifier  f(x)  =  aTx  —  b  for  the  two  sets  of  points.  This  classifier  has  the  following 
property:  Assuming  the  data  points  are  in  fact  generated  from  a  logistic  model 
with  parameters  a  and  b ,  it  has  the  smallest  probability  of  misclassification,  over 
all  linear  classifiers.  The  hyperplane  aTu  =  b  corresponds  to  the  points  where 
prob(z  =  1)  =  1/2,  i.e.,  the  two  outcomes  are  equally  likely.  An  example  is  shown 
in  figure  8.12. 


Remark  8.1  Bayesian  interpretation.  Let  x  and  z  be  two  random  variables,  taking 
values  in  Rn  and  in  {0, 1},  respectively.  We  assume  that 

prob(z  =  1)  =  prob(z  =  0)  =  1/2, 


and  we  denote  by  po{x)  and  pi(x)  the  conditional  probability  densities  of  x,  given 
z  —  0  and  given  z  =  1,  respectively.  We  assume  that  po  and  pi  satisfy 


Pl(x)  a^x  —  b 

Po{x )  6 


for  some  a  and  b.  Many  common  distributions  satisfy  this  property.  For  example, 
po  and  pi  could  be  two  normal  densities  on  R"  with  equal  covariance  matrices  and 
different  means,  or  they  could  be  two  exponential  densities  on  R" . 

It  follows  from  Bayes’  rule  that 


prob(z  =  1  |  x  =  u) 

prob(z  =  0  |  x  =  u) 
from  which  we  obtain 

prob(z  =  1  |  x  =  u) 


Pi  (  u) 

pi(u)  +Po(m) 
Po{u) 

pi(u)  +po(u)  ’ 

exp  (aTu  —  b) 

1  +  exp(aTw  —  6) 
1 


prob(z  =  0  |  x  =  u) 


1  +  exp (aTu  —  b) ' 
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Figure  8.12  Approximate  linear  discrimination  via  logistic  modeling.  The 
points  xi, ,  *5o,  shown  as  open  circles,  cannot  be  linearly  separated  from 
the  points  yi, ,  1/50,  shown  as  filled  circles.  The  maximum  likelihood  lo¬ 
gistic  model  yields  the  hyperplane  shown  as  a  dark  line,  which  misclassifies 
only  two  points.  The  two  dashed  lines  show  aTu  —  b=  ±1,  where  the  proba¬ 
bility  of  each  outcome,  according  to  the  logistic  model,  is  73%.  Three  points 
are  correctly  classified,  but  lie  in  between  the  dashed  lines. 


The  logistic  model  (8.26)  can  therefore  be  interpreted  as  the  posterior  distribution  of 
z,  given  that  x  =  u. 


8.6.2  Nonlinear  discrimination 

We  can  just  as  well  seek  a  nonlinear  function  /,  from  a  given  subspace  of  functions, 
that  is  positive  on  one  set  and  negative  on  another: 

f(xi)>  0,  i  =  l,...,N,  f(yi)  <  0,  i  =  l,...,M. 

Provided  /  is  linear  (or  affine)  in  the  parameters  that  define  it,  these  inequalities 
can  be  solved  in  exactly  the  same  way  as  in  linear  discrimination.  In  this  section 
we  examine  some  interesting  special  cases. 

Quadratic  discrimination 

Suppose  we  take  /  to  be  quadratic:  f(x)  =  xTPx  +  qTx  +  r.  The  parameters 
P  £  Sn,  q  £  R”,  r  €  R  must  satisfy  the  inequalities 


xj Pxi  +  qTXi  +  r  >  0,  i  =  1, . .  .,N 
yfPyi  + qTyi  +  r  <  o,  i  =  i,...,M, 
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which  is  a  set  of  strict  linear  inequalities  in  the  variables  P,  q ,  r.  As  in  linear 
discrimination,  we  note  that  f  is  homogeneous  in  P,  q ,  and  r,  so  we  can  find  a 
solution  to  the  strict  inequalities  by  solving  the  nonstrict  feasibility  problem 

xj PXi  +  qTXi  +  r  >  1,  i  =  1, . . .  ,N 
yj  Pyi  +  qTVi  +  r<- 1,  i  = 

The  separating  surface  { z  \  zTPz  +  qT z  +  r  =  0}  is  a  quadratic  surface,  and 
the  two  classification  regions 

{z  |  zTPz  +  qTz  +  r  <  0},  {z  \  zTPz  +  qT z  +  r  >  0}, 

are  defined  by  quadratic  inequalities.  Solving  the  quadratic  discrimination  problem, 
then,  is  the  same  as  determining  whether  the  two  sets  of  points  can  be  separated 
by  a  quadratic  surface. 

We  can  impose  conditions  on  the  shape  of  the  separating  surface  or  classification 
regions  by  adding  constraints  on  P,  q1  and  r.  For  example,  we  can  require  that 
P  -<  0,  which  means  the  separating  surface  is  ellipsoidal.  More  specifically,  it  means 
that  we  seek  an  ellipsoid  that  contains  all  the  points  ,Xn,  but  none  of  the 

points  j/i, ... ,  dm-  This  quadratic  discrimination  problem  can  be  solved  as  an  SDP 
feasibility  problem 

find  P,  </,  r 

subject  to  xf  Pxi  +  qTXi  +  r  >  1,  i  =  1, . . . ,  N 
yfPyi  +  qryi  +  r  <  -1,  i  = 

P<~I , 

with  variables  P  £  S",  q  £  R",  and  r  £  R.  (Here  we  use  homogeneity  in  P,  q,  r 
to  express  the  constraint  P  -<  0  as  P  <  —  I.)  Figure  8.13  shows  an  example. 

Polynomial  discrimination 

We  consider  the  set  of  polynomials  on  R"  with  degree  less  than  or  equal  to  d: 
.f(x)=  ah-idXi 

i\-\ - \-in<.d 

We  can  determine  whether  or  not  two  sets  {x\, . . .  ,Xjv}  and  {j/i, . . . , um}  can  be 
separated  by  such  a  polynomial  by  solving  a  set  of  linear  inequalities  in  the  variables 
a,i1...id.  Geometrically,  we  are  checking  whether  the  two  sets  can  be  separated  by 
an  algebraic  surface  (defined  by  a  polynomial  of  degree  less  than  or  equal  to  d). 

As  an  extension,  the  problem  of  determining  the  minimum  degree  polynomial  on 
R™  that  separates  two  sets  of  points  can  be  solved  via  quasiconvex  programming, 
since  the  degree  of  a  polynomial  is  a  quasiconvex  function  of  the  coefficients.  This 
can  be  carried  out  by  bisection  on  d ,  solving  a  feasibility  linear  program  at  each 
step.  An  example  is  shown  in  figure  8.14. 
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Figure  8.13  Quadratic  discrimination,  with  the  condition  that  P  -<  0.  This 
means  that  we  seek  an  ellipsoid  containing  all  of  Xi  (shown  as  open  circles) 
and  none  of  the  yt  (shown  as  filled  circles).  This  can  be  solved  as  an  SDP 
feasibility  problem. 


Figure  8.14  Minimum  degree  polynomial  discrimination  in  R2.  In  this  ex¬ 
ample,  there  exists  no  cubic  polynomial  that  separates  the  points  xi , ,xn 
(shown  as  open  circles)  from  the  points  yi, ,  yM  (shown  as  filled  circles), 
but  they  can  be  separated  by  fourth-degree  polynomial,  the  zero  level  set  of 
which  is  shown. 
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8.7  Placement  and  location 

In  this  section  we  discuss  a  few  variations  on  the  following  problem.  We  have 
N  points  in  R2  or  R3,  and  a  list  of  pairs  of  points  that  must  be  connected  by 
links.  The  positions  of  some  of  the  N  points  are  fixed;  our  task  is  to  determine  the 
positions  of  the  remaining  points,  i.e.,  to  place  the  remaining  points.  The  objective 
is  to  place  the  points  so  that  some  measure  of  the  total  interconnection  length  of 
the  links  is  minimized,  subject  to  some  additional  constraints  on  the  positions. 
As  an  example  application,  we  can  think  of  the  points  as  locations  of  plants  or 
warehouses  of  a  company,  and  the  links  as  the  routes  over  which  goods  must  be 
shipped.  The  goal  is  to  find  locations  that  minimize  the  total  transportation  cost. 
In  another  application,  the  points  represent  the  position  of  modules  or  cells  on  an 
integrated  circuit,  and  the  links  represent  wires  that  connect  pairs  of  cells.  Here 
the  goal  might  be  to  place  the  cells  in  such  a  way  that  the  total  length  of  wire  used 
to  interconnect  the  cells  is  minimized. 

The  problem  can  be  described  in  terms  of  an  undirected  graph  with  N  nodes, 
representing  the  N  points.  With  each  node  we  associate  a  variable  ay  £  Rfe,  where 
k  =  2  or  k  =  3,  which  represents  its  location  or  position.  The  problem  is  to 
minimize 

'y  1  fij  (xi  >  xj ) 


where  A  is  the  set  of  all  links  in  the  graph,  and  fa  :  Rfc  x  Rfe  ->  R  is  a  cost 
function  associated  with  arc  (i,j).  (Alternatively,  we  can  sum  over  all  i  and  j,  or 
over  i  <  j,  and  simply  set  fa  =  0  when  links  i  and  j  are  not  connected.)  Some  of 
the  coordinate  vectors  ay  are  given.  The  optimization  variables  are  the  remaining 
coordinates.  Provided  the  functions  fa  are  convex,  this  is  a  convex  optimization 
problem. 


8.7.1  Linear  facility  location  problems 

In  the  simplest  version  of  the  problem  the  cost  associated  with  arc  (i,j)  is  the 
distance  between  nodes  i  and  j:  fa(xi,Xj)  =  ||ay  —  Xj\ |,  i.e.,  we  minimize 

Wx*  xj  II  ■ 

We  can  use  any  norm,  but  the  most  common  applications  involve  the  Euclidean 
norm  or  the  fi-norm.  For  example,  in  circuit  design  it  is  common  to  route  the  wires 
between  cells  along  piecewise-linear  paths,  with  each  segment  either  horizontal  or 
vertical.  (This  is  called  Manhattan  routing,  since  paths  along  the  streets  in  a  city 
with  a  rectangular  grid  are  also  piecewise-linear,  with  each  street  aligned  with  one 
of  two  orthogonal  axes.)  In  this  case,  the  length  of  wire  required  to  connect  cell  i 
and  cell  j  is  given  by  ||ay  —  ar,j|i. 

We  can  include  nonnegative  weights  that  reflect  differences  in  the  cost  per  unit 
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distance  along  different  arcs: 


wio\\xi~xoW- 

(iJ)eA 

By  assigning  a  weight  Wij  =  0  to  pairs  of  nodes  that  are  not  connected,  we  can 
express  this  problem  more  simply  using  the  objective 

Y^Wij\\xi  ~  Xj\\.  (8.28) 

i<j 

This  placement  problem  is  convex. 


Example  8.4  One  free  point.  Consider  the  case  where  only  one  point  (it,  v)  £  R2  is 
free,  and  we  minimize  the  sum  of  the  distances  to  fixed  points  (iti,  i>i), . . . ,  (uk,Vk). 

•  t\-norm.  We  can  find  a  point  that  minimizes 

K 

^(|w-Wi|  +  |i;  —  Hi|) 
i= 1 

analytically.  An  optimal  point  is  any  median  of  the  fixed  points.  In  other  words, 
it  can  be  taken  to  be  any  median  of  the  points  (iti, . . . ,  Uk},  and  v  can  be  taken 
to  be  any  median  of  the  points  {m, . . .  ,vk}-  (If  K  is  odd,  the  minimizer  is 
unique;  if  K  is  even,  there  can  be  a  rectangle  of  optimal  points.) 

•  Euclidean  norm.  The  point  (it,  v)  that  minimizes  the  sum  of  the  Euclidean 
distances, 

^2  ((«  -  Mi)2  +  (v-  Vi)2) 1/2  , 
i= 1 

is  called  the  Weber  point  of  the  given  fixed  points. 


8.7.2  Placement  constraints 

We  now  list  some  interesting  constraints  that  can  be  added  to  the  basic  placement 
problem,  preserving  convexity.  We  can  require  some  positions  Xi  to  lie  in  a  specified 
convex  set,  e.g.,  a  particular  line,  interval,  square,  or  ellipsoid.  We  can  constrain 
the  relative  position  of  one  point  with  respect  to  one  or  more  other  points,  for 
example,  by  limiting  the  distance  between  a  pair  of  points.  We  can  impose  relative 
position  constraints,  e.g.,  that  one  point  must  lie  to  the  left  of  another  point. 

The  bounding  box  of  a  group  of  points  is  the  smallest  rectangle  that  contains 
the  points.  We  can  impose  a  constraint  that  limits  the  points  x\,...,xp  (say)  to  lie 
in  a  bounding  box  with  perimeter  not  exceeding  Pmax,  by  adding  the  constraints 

u<Xi<v,  i=l,...,p,  211  (v  —  u)  <  Pmax, 

where  u,  v  are  additional  variables. 
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8.7.3  Nonlinear  facility  location  problems 

More  generally,  we  can  associate  a  cost  with  each  arc  that  is  a  nonlinear  increasing 
function  of  the  length,  i.e., 

minimize  WijH\\xi  -  Xj\\) 

where  h  is  an  increasing  (on  R+)  and  convex  function,  and  Wij  >  0.  We  call  this 
a  nonlinear  placement  or  nonlinear  facility  location  problem. 

One  common  example  uses  the  Euclidean  norm,  and  the  function  h(z)  =  z  , 
i.e.,  we  minimize 

^WijWXi  -Xj\\l. 

i<j 

This  is  called  a  quadratic  placement  problem.  The  quadratic  placement  problem 
can  be  solved  analytically  when  the  only  constraints  are  linear  equalities;  it  can  be 
solved  as  a  QP  if  the  constraints  are  linear  equalities  and  inequalities. 


Example  8.5  One  free  point.  Consider  the  case  where  only  one  point  x  is  free,  and  we 
minimize  the  sum  of  the  squares  of  the  Euclidean  distances  to  fixed  points  xi, . . . ,  xk, 

II*  -  *i 1 1 2  +  ||*  -  X2W2  d - h  ||*  -  xk ll^- 

Taking  derivatives,  we  see  that  the  optimal  x  is  given  by 

-^(*1  +  X2  H - +  Xk), 

K 

i.e.,  the  average  of  the  fixed  points. 


Some  other  interesting  possibilities  are  the  ‘deadzone’  function  h  with  deadzone 
width  2y,  defined  as 

h(z)-l  °  1*1- 7 

M>7, 

and  the  ‘quadratic-linear’  function  h,  defined  as 

Mz)  ^  N  ^  7 

^z)-\  27\z\-72  |*|  >7. 

Example  8.6  We  consider  a  placement  problem  in  R2  with  6  free  points,  8  fixed 
points,  and  27  links.  Figures  8.15-8.17  show  the  optimal  solutions  for  the  criteria 

En  11  V''  11  112  11  114 

||*i  Xj  ||  2,  2_^  II  Xi  Xj  ||  2 ,  2_^  II  **  *J  ||  2 5 

(i,j)£A 

i.e.,  using  the  penalty  functions  h(z)  =  z,  h(z)  =  z2,  and  h(z)  =  z4.  The  figures  also 
show  the  resulting  distributions  of  the  link  lengths. 

Comparing  the  results,  we  see  that  the  linear  placement  concentrates  the  free  points  in 
a  small  area,  while  the  quadratic  and  fourth-order  placements  spread  the  points  over 
larger  areas.  The  linear  placement  includes  many  very  short  links,  and  a  few  very  long 
ones  (3  lengths  under  0.2  and  2  lengths  above  1.5.).  The  quadratic  penalty  function 
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Figure  8.15  Linear  placement.  Placement  problem  with  6  free  points  (shown 
as  dots),  8  fixed  points  (shown  as  squares),  and  27  links.  The  coordinates  of 
the  free  points  minimize  the  sum  of  the  Euclidean  lengths  of  the  links.  The 
right  plot  is  the  distribution  of  the  27  link  lengths.  The  dashed  curve  is  the 
(scaled)  penalty  function  h(z)  =  z. 
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Figure  8.16  Quadratic  placement.  Placement  that  minimizes  the  sum  of 
squares  of  the  Euclidean  lengths  of  the  links,  for  the  same  data  as  in  fig¬ 
ure  8.15.  The  dashed  curve  is  the  (scaled)  penalty  function  h(z)  =  z2 . 
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Figure  8.17  Fourth- order  placement.  Placement  that  minimizes  the  sum  of 
the  fourth  powers  of  the  Euclidean  lengths  of  the  links.  The  dashed  curve 
is  the  (scaled)  penalty  function  h(z)  =  zA. 


puts  a  higher  penalty  on  long  lengths  relative  to  short  lengths,  and  for  lengths  under 
0.1,  the  penalty  is  almost  negligible.  As  a  result,  the  maximum  length  is  shorter  (less 
than  1.4),  but  we  also  have  fewer  short  links.  The  fourth-order  function  puts  an  even 
higher  penalty  on  long  lengths,  and  has  a  wider  interval  (between  zero  and  about 
0.4)  where  it  is  negligible.  As  a  result,  the  maximum  length  is  shorter  than  for  the 
quadratic  placement,  but  we  also  have  more  lengths  close  to  the  maximum. 


8.7.4  Location  problems  with  path  constraints 

Path  constraints 

A  p-link  path  along  the  points  xi,...,Xn  is  described  by  a  sequence  of  nodes, 
io, ...  ,ip  €  {1, . . . ,  N}.  The  length  of  the  path  is  given  by 

ll*ii  -  xi0  II  +  \\xh  -ZiJ  +•••  +  I \Xip  ~xip_  J, 

which  is  a  convex  function  of  X\, . . . ,  Xjv,  so  imposing  an  upper  bound  on  the  length 
of  a  path  is  a  convex  constraint.  Several  interesting  placement  problems  involve 
path  constraints,  or  have  an  objective  based  on  path  lengths.  We  describe  one 
typical  example,  in  which  the  objective  is  based  on  a  maximum  path  length  over  a 
set  of  paths. 

Minimax  delay  placement 

We  consider  a  directed  acyclic  graph  with  nodes  1, . . . ,  TV,  and  arcs  or  links  repre¬ 
sented  by  a  set  A  of  ordered  pairs:  (i,j)  €  A  if  and  only  if  an  arc  points  from  i 
to  j.  We  say  node  i  is  a  source  node  if  no  arc  A  points  to  it;  it  is  a  sink  node  or 
destination  node  if  no  arc  in  A  leaves  from  it.  We  will  be  interested  in  the  maximal 
paths  in  the  graph,  which  begin  at  a  source  node  and  end  at  a  sink  node. 

The  arcs  of  the  graph  are  meant  to  model  some  kind  of  flow,  say  of  goods  or 
information,  in  a  network  with  nodes  at  positions  x\, . . . ,  xjy.  The  flow  starts  at 
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a  source  node,  then  moves  along  a  path  from  node  to  node,  ending  at  a  sink  or 
destination  node.  We  use  the  distance  between  successive  nodes  to  model  prop¬ 
agation  time,  or  shipment  time,  of  the  goods  between  nodes;  the  total  delay  or 
propagation  time  of  a  path  is  (proportional  to)  the  sum  of  the  distances  between 
successive  nodes. 

Now  we  can  describe  the  minimax  delay  placement  problem.  Some  of  the  node 
locations  are  fixed,  and  the  others  are  free,  i.e.,  optimization  variables.  The  goal 
is  to  choose  the  free  node  locations  in  order  to  minimize  the  maximum  total  delay, 
for  any  path  from  a  source  node  to  a  sink  node.  Evidently  this  is  a  convex  problem, 
since  the  objective 

Tjnax  =  max-dixj!  -  xio  ||  4 - f  \\xip  -  Xi p-1 1|  |  i0, . . . ,  ip  is  a  source-sink  path} 

(8.29) 

is  a  convex  function  of  the  locations  X\, . . .  ,Xn. 

While  the  problem  of  minimizing  (8.29)  is  convex,  the  number  of  source-sink 
paths  can  be  very  large,  exponential  in  the  number  of  nodes  or  arcs.  There  is 
a  useful  reformulation  of  the  problem,  which  avoids  enumerating  all  sink-source 
paths. 

We  first  explain  how  we  can  evaluate  the  maximum  delay  Tmax  far  more  ef¬ 
ficiently  than  by  evaluating  the  delay  for  every  source-sink  path,  and  taking  the 
maximum.  Let  Tk  be  the  maximum  total  delay  of  any  path  from  node  k  to  a  sink 
node.  Clearly  we  have  Tk  =  0  when  k  is  a  sink  node.  Consider  a  node  k1  which  has 
outgoing  arcs  to  nodes  j\, . . . ,  jp .  For  a  path  starting  at  node  k  and  ending  at  a 
sink  node,  its  first  arc  must  lead  to  one  of  the  nodes  j\, . . . ,jp.  If  such  a  path  first 
takes  the  arc  leading  to  ji,  and  then  takes  the  longest  path  from  there  to  a  sink 
node,  the  total  length  is 

I \Xji  ~Xk  II  T  Tji , 

i.e.,  the  length  of  the  arc  to  ji,  plus  the  total  length  of  the  longest  path  from  ji  to 
a  sink  node.  It  follows  that  the  maximum  delay  of  a  path  starting  at  node  k  and 
leading  to  a  sink  node  satisfies 

Tk  =  max}!!^  -xk\\  +Th,...,  \\xjp  -  xk\\  +  rjp}.  (8.30) 

(This  is  a  simple  dynamic  programming  argument.) 

The  equations  (8.30)  give  a  recursion  for  finding  the  maximum  delay  from  any 
node:  we  start  at  the  sink  nodes  (which  have  maximum  delay  zero),  and  then 
work  backward  using  the  equations  (8.30),  until  we  reach  all  source  nodes.  The 
maximum  delay  over  any  such  path  is  then  the  maximum  of  all  the  Tk,  which  will 
occur  at  one  of  the  source  nodes.  This  dynamic  programming  recursion  shows 
how  the  maximum  delay  along  any  source-sink  path  can  be  computed  recursively, 
without  enumerating  all  the  paths.  The  number  of  arithmetic  operations  required 
for  this  recursion  is  approximately  the  number  of  links. 

Now  we  show  how  the  recursion  based  on  (8.30)  can  be  used  to  formulate  the 
minimax  delay  placement  problem.  We  can  express  the  problem  as 

minimize  max{rfc  |  k  a  source  node} 
subject  to  Tfc  =  0,  k  a  sink  node 

Tfc  =  maxilla  —  Xk\\+  Tj  |  there  is  an  arc  from  k  to  j}, 
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with  variables  t\,  . . .  ,tn  and  the  free  positions.  This  problem  is  not  convex,  but 
we  can  express  it  in  an  equivalent  form  that  is  convex,  by  replacing  the  equality 
constraints  with  inequalities.  We  introduce  new  variables  Ti, . . . ,  TV,  which  will  be 
upper  bounds  on  n,. . . ,  rjv,  respectively.  We  will  take  Tk  =  0  for  all  sink  nodes, 
and  in  place  of  (8.30)  we  take  the  inequalities 

Tk  >  max^a^  —  xk\\  +  Th, . . . ,  || xjp  -  xk\\  +  Tjp}. 

If  these  inequalities  are  satisfied,  then  Tk  >  rk.  Now  we  form  the  problem 

minimize  max{Tfc  |  k  a  source  node} 
subject  to  Tfc  =  0,  k  a  sink  node 

Tk  >  max{||xj  —  xk\\  +  Tj  ]  there  is  an  arc  from  k  to  j}. 

This  problem,  with  variables  T\, . . . ,  TV  and  the  free  locations,  is  convex,  and  solves 
the  minimax  delay  location  problem. 


8.8  Floor  planning 

In  placement  problems,  the  variables  represent  the  coordinates  of  a  number  of 
points  that  are  to  be  optimally  placed.  A  floor  planning  problem  can  be  considered 
an  extension  of  a  placement  problem  in  two  ways: 

•  The  objects  to  be  placed  are  rectangles  or  boxes  aligned  with  the  axes  (as 
opposed  to  points),  and  must  not  overlap. 

•  Each  rectangle  or  box  to  be  placed  can  be  reconfigured,  within  some  limits. 
For  example  we  might  fix  the  area  of  each  rectangle,  but  not  the  length  and 
height  separately. 

The  objective  is  usually  to  minimize  the  size  ( e.g .,  area,  volume,  perimeter)  of  the 
bounding  box ,  which  is  the  smallest  box  that  contains  the  boxes  to  be  configured 
and  placed. 

The  non-overlap  constraints  make  the  general  floor  planning  problem  a  compli¬ 
cated  combinatorial  optimization  problem  or  rectangle  packing  problem.  However, 
if  the  relative  positioning  of  the  boxes  is  specified,  several  types  of  floor  planning 
problems  can  be  formulated  as  convex  optimization  problems.  We  explore  some 
of  these  in  this  section.  We  consider  the  two-dimensional  case,  and  make  a  few 
comments  on  extensions  to  higher  dimensions  (when  they  are  not  obvious). 

We  have  N  cells  or  modules  C\ , . . . ,  Cn  that  are  to  be  configured  and  placed 
in  a  rectangle  with  width  W  and  height  H,  and  lower  left  corner  at  the  position 
(0,0).  The  geometry  and  position  of  the  ith  cell  is  specified  by  its  width  vjt  and 
height  hi,  and  the  coordinates  ( Xi,y.i )  of  its  lower  left  corner.  This  is  illustrated  in 
figure  8.18. 

The  variables  in  the  problem  are  Xi,  yi,  Wi ,  hi  for  i  =  1, . . . ,  N,  and  the  width 
W  and  height  H  of  the  bounding  rectangle.  In  all  floor  planning  problems,  we 
require  that  the  cells  lie  inside  the  bounding  rectangle,  i.e., 

Xi>  0,  yi  >  0,  Xi  +  Wi<W,  yi  +  hi<H ,  *  =  !,..., TV.  (8.31) 
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Figure  8.18  Floor  planning  problem.  Non-overlapping  rectangular  cells  are 
placed  in  a  rectangle  with  width  W,  height  H ,  and  lower  left  corner  at  (0, 0). 
The  ith  cell  is  specified  by  its  width  Wi,  height  hi,  and  the  coordinates  of  its 
lower  left  corner,  (*<,?/»). 


We  also  require  that  the  cells  do  not  overlap,  except  possibly  on  their  boundaries: 

int  (Cj  n  Cj)  =  0  for  i  ^  j. 

(It  is  also  possible  to  require  a  positive  minimum  clearance  between  the  cells.)  The 
non-overlap  constraint  int(C,;  fl  Cj)  =  0  holds  if  and  only  if  for  i  ^  j, 

Ci  is  left  of  Cj,  or  C\  is  right  of  Cj,  or  C)  is  below  Cj,  or  G)  is  above  Cj. 

These  four  geometric  conditions  correspond  to  the  inequalities 

Xi  +  Wi  <  Xj,  or  Xj  +  Wj  <  Xi,  or  yi  +  hj  <  yj,  or  yj  +  hi  <  yi ,  (8.32) 

at  least  one  of  which  must  hold  for  each  i  ^  j.  Note  the  combinatorial  nature  of 
these  constraints:  for  each  pair  i  ^  j,  at  least  one  of  the  four  inequalities  above 
must  hold. 


8.8.1  Relative  positioning  constraints 

The  idea  of  relative  positioning  constraints  is  to  specify,  for  each  pair  of  cells, 
one  of  the  four  possible  relative  positioning  conditions,  i.e.,  left,  right,  above,  or 
below.  One  simple  method  to  specify  these  constraints  is  to  give  two  relations  on 
{l,...,iV}:  C  (meaning  ‘left  of’)  and  B  (meaning  ‘below’).  We  then  impose  the 
constraint  that  Cj  is  to  the  left  of  Cj  if  (i,j)  £  C,  and  G,  is  below  Cj  if  (i,j)  £  B. 
This  yields  the  constraints 


Xi  +  Wi  <  Xj  for  (i,j)  £  C, 


Vi  +  hi  <  yj  for  (i,j)  £  B, 


(8.33) 
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for  i,j  =  1  To  ensure  that  the  relations  £  and  B  specify  the  relative 

positioning  of  each  pair  of  cells,  we  require  that  for  each  ( i,j )  with  i  ^  j,  one  of 
the  following  holds: 

(i,j)<=£1  ( j,i)  £  £ ,  (. , 

and  that  (i,  i)  ^  C,  ( i ,  i)  ^  B.  The  inequalities  (8.33)  are  a  set  of  N(N  —  l)/2  linear 
inequalities  in  the  variables.  These  inequalities  imply  the  non-overlap  inequali¬ 
ties  (8.32),  which  are  a  set  of  N(N  —  l)/2  disjunctions  of  four  linear  inequalities. 

We  can  assume  that  the  relations  C  and  B  are  anti-symmetric  (i.e.,  ( i,j )  £ 
£  =>•  CM)  ^  £)  an(i  transitive  (i.e.,  ( i,j )  £  £,  ( j,k )  £  £  =>  (i,k)  £  £).  (If  this 
were  not  the  case,  the  relative  positioning  constraints  would  clearly  be  infeasible.) 
Transitivity  corresponds  to  the  obvious  condition  that  if  cell  C)  is  to  the  left  of  cell 
Cj,  which  is  to  the  left  of  cell  Ck,  then  cell  Cj  must  be  to  the  left  of  cell  Ck-  In 
this  case  the  inequality  corresponding  to  ( i ,  k)  £  C  is  redundant;  it  is  implied  by 
the  other  two.  By  exploiting  transitivity  of  the  relations  £  and  B  we  can  remove 
redundant  constraints,  and  obtain  a  compact  set  of  relative  positioning  inequalities. 

A  minimal  set  of  relative  positioning  constraints  is  conveniently  described  using 
two  directed  acyclic  graphs  H  and  V  (for  horizontal  and  vertical).  Both  graphs  have 
N  nodes,  corresponding  to  the  N  cells  in  the  floor  planning  problem.  The  graph 
H  generates  the  relation  £  as  follows:  we  have  (i,j)  £  £  if  and  only  if  there  is 
a  (directed)  path  in  T~L  from  i  to  j.  Similarly,  the  graph  V  generates  the  relation 
B:  ( i,j )  £  B  if  and  only  if  there  is  a  (directed)  path  in  V  from  i  to  j.  To  ensure 
that  a  relative  positioning  constraint  is  given  for  every  pair  of  cells,  we  require  that 
for  every  pair  of  cells,  there  is  a  directed  path  from  one  to  the  other  in  one  of  the 
graphs. 

Evidently,  we  only  need  to  impose  the  inequalities  that  correspond  to  the  edges 
of  the  graphs  H  and  V;  the  others  follow  from  transitivity.  We  arrive  at  the  set  of 
inequalities 

Xi  +  Wi  <  Xj  for  (i,j)  £  hi,  yi  +  hi  <  ijj  for  (i,j)  £  V,  (8.34) 

which  is  a  set  of  linear  inequalities,  one  for  each  edge  in  hi  and  V.  The  set  of 
inequalities  (8.34)  is  a  subset  of  the  set  of  inequalities  (8.33),  and  equivalent. 

In  a  similar  way,  the  4iV  inequalities  (8.31)  can  be  reduced  to  a  minimal,  equiv¬ 
alent  set.  The  constraint  Xi  >  0  only  needs  to  be  imposed  on  the  left-most  cells, 
i.e.,  for  i  that  are  minimal  in  the  relation  £.  These  correspond  to  the  sources  in 
the  graph  ji,  i.e.,  those  nodes  that  have  no  edges  pointing  to  them.  Similarly,  the 
inequalities  x^  +  <  W  only  need  to  be  imposed  for  the  right-most  cells.  In  the 

same  way  the  vertical  bounding  box  inequalities  can  be  pruned  to  a  minimal  set. 
This  yields  the  minimal  equivalent  set  of  bounding  box  inequalities 

Xi  >  0  for  i  £  minimal,  Xi  +  Wi  <W  for  i  £  maximal, 

Vi  >  0  for  i  B  minimal,  Ui  +  hi  <  H  for  i  B  maximal.  '  ’ 

A  simple  example  is  shown  in  figure  8.19.  In  this  example,  the  £  minimal  or 
left-most  cells  are  C\,  C2,  and  C4,  and  the  only  right-most  cell  is  C5.  The  minimal 
set  of  inequalities  specifying  the  horizontal  relative  positioning  is  given  by 

x\  >0,  X2  >  0,  X4  >0,  X5  +  W5  <  W,  x\  +  w  1  <  X3, 

X2+W2<  X3,  X3  +  W3  <  X5,  X4  +  W4  <  x5. 
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Figure  8.19  Example  illustrating  the  horizontal  and  vertical  graphs  H  and 
V  that  specify  the  relative  positioning  of  the  cells.  If  there  is  a  path  from 
node  i  to  node  j  in  H,  then  cell  i  must  be  placed  to  the  left  of  cell  j.  If  there 
is  a  path  from  node  i  to  node  j  in  V,  then  cell  i  must  be  placed  below  cell 
j.  The  floorplan  shown  at  right  satisfies  the  relative  positioning  specified  by 
the  two  graphs. 


The  minimal  set  of  inequalities  specifying  the  vertical  relative  positioning  is  given 

by 


2/2  >0,  2/3  >  o,  2/5  >0,  2/4  +  hi  <  H,  2/5  +  h5  <  H, 

2/2  + /12  <  2/1,  yi  +  hi  <  1/4,  2/3  +  /l3<2/4- 


8.8.2  Floor  planning  via  convex  optimization 

In  this  formulation,  the  variables  are  the  bounding  box  width  and  height  W  and 
H ,  and  the  cell  widths,  heights,  and  positions:  Wi,  hi,  Xi,  and  Wi,  for  i  =  1, . . . ,  N. 
We  impose  the  bounding  box  constraints  (8.35)  and  the  relative  positioning  con¬ 
straints  (8.34),  which  are  linear  inequalities.  As  objective,  we  take  the  perimeter 
of  the  bounding  box,  i.e.,  2(W  +  H),  which  is  a  linear  function  of  the  variables. 
We  now  list  some  of  the  constraints  that  can  be  expressed  as  convex  inequalities 
or  linear  equalities  in  the  variables. 

Minimum  spacing 

We  can  impose  a  minimum  spacing  p  >  0  between  cells  by  changing  the  relative 
position  constraints  from  Xi  +  Wi  <  Xj  for  (i,j)  €  H,  to  Xi  +  +  p  <  Xj  for 

(i,  j)  £  'H ,  and  similarly  for  the  vertical  graph.  We  can  have  a  different  minimum 
spacing  associated  with  each  edge  in  T~L  and  V.  Another  possibility  is  to  fix  W  and 
H ,  and  maximize  the  minimum  spacing  p  as  objective. 


442 


8  Geometric  problems 


Minimum  cell  area 

For  each  cell  we  specify  a  minimum  area,  i.e.,  we  require  that  Wihj  >  Ai,  where 
Ai  >  0.  These  minimum  cell  area  constraints  can  be  expressed  as  convex  inequali¬ 
ties  in  several  ways,  e.g.,  Wi  >  Ai/hi,  ( Wihi )1^2  >  A1/2 ,  or  logWj  +log  hi  >  log  Ai. 

Aspect  ratio  constraints 

We  can  impose  upper  and  lower  bounds  on  the  aspect  ratio  of  each  cell,  i.e., 

k  <  hi/wi  <  Ui . 

Multiplying  through  by  wy  transforms  these  constraints  into  linear  inequalities.  We 
can  also  fix  the  aspect  ratio  of  a  cell,  which  results  in  a  linear  equality  constraint. 

Alignment  constraints 

We  can  impose  the  constraint  that  two  edges,  or  a  center  line,  of  two  cells  are 
aligned.  For  example,  the  horizontal  center  line  of  cell  i  aligns  with  the  top  of  cell 
j  when 

Vi  +Wi/2  =  yj  +wj. 

These  are  linear  equality  constraints.  In  a  similar  way  we  can  require  that  a  cell  is 
flushed  against  the  bounding  box  boundary. 

Symmetry  constraints 

We  can  require  pairs  of  cells  to  be  symmetric  about  a  vertical  or  horizontal  axis, 
that  can  be  fixed  or  floating  (i.e.,  whose  position  is  fixed  or  not).  For  example,  to 
specify  that  the  pair  of  cells  i  and  j  are  symmetric  about  the  vertical  axis  x  =  x^s, 
we  impose  the  linear  equality  constraint 

•Cn.xis  (Ti  T  W’,;  /2 )  =  Xj  '  Wj/2  3?axis- 

We  can  require  that  several  pairs  of  cells  be  symmetric  about  an  unspecified  vertical 
axis  by  imposing  these  equality  constraints,  and  introducing  ccaxis  as  a  new  variable. 

Similarity  constraints 

We  can  require  that  cell  i  be  an  a-scaled  translate  of  cell  j  by  the  equality  con¬ 
straints  Wi  =  awj,  hi  =  ahj.  Here  the  scaling  factor  a  must  be  fixed.  By  imposing 
only  one  of  these  constraints,  we  require  that  the  width  (or  height)  of  one  cell  be 
a  given  factor  times  the  width  (or  height)  of  the  other  cell. 

Containment  constraints 

We  can  require  that  a  particular  cell  contains  a  given  point,  which  imposes  two  lin¬ 
ear  inequalities.  We  can  require  that  a  particular  cell  lie  inside  a  given  polyhedron, 
again  by  imposing  linear  inequalities. 
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Distance  constraints 

We  can  impose  a  variety  of  constraints  that  limit  the  distance  between  pairs  of 
cells.  In  the  simplest  case,  we  can  limit  the  distance  between  the  center  points 
of  cell  i  and  j  (or  any  other  fixed  points  on  the  cells,  such  as  lower  left  corners). 
For  example,  to  limit  the  distance  between  the  centers  of  cells  i  and  j ,  we  use  the 
(convex)  inequality 

\\(xi  +  Wi/2,  yi  +  hi/ 2)  -  (xj  +  Wj/2,  yj  +  hj/ 2)||  <  Ap 

As  in  placement  problems,  we  can  limit  sums  of  distances,  or  use  sums  of  distances 
as  the  objective. 

We  can  also  limit  the  distance  dist(C), Cj)  between  cell  i  and  cell  j,  i.e.,  the 
minimum  distance  between  a  point  in  cell  i  and  a  point  in  cell  j.  In  the  general 
case  this  can  be  done  as  follows.  To  limit  the  distance  between  cells  i  and  j  in  the 
norm  ||  •  ||,  we  can  introduce  four  new  variables  iq,  i>j,  Uj,  Vj.  The  pair  ( Ui,Vi ) 
will  represent  a  point  in  Ci,  and  the  pair  ( Uj,Vj )  will  represent  a  point  in  Cj.  To 
ensure  this  we  impose  the  linear  inequalities 

Xi  <  Ui  <  Xi  +  Wi,  yi  <  Vi  <  yi  +  hi, 

and  similarly  for  cell  j.  Finally,  to  limit  dist(Cj,  Cj),  we  add  the  convex  inequality 

II (ui,Vi)  -  (uj,Vj) ||  <  Dij. 

In  many  specific  cases  we  can  express  these  distance  constraints  more  efficiently, 
by  exploiting  the  relative  positioning  constraints  or  deriving  a  more  explicit  formu¬ 
lation.  As  an  example  consider  the  foo-norm,  and  suppose  cell  i  lies  to  the  left  of 
cell  j  (by  a  relative  positioning  constraint).  The  horizontal  displacement  between 
the  two  cells  is  Xj  —  (aq  +  u\)  Then  we  have  dist(C),  Cj)  <  Dtj  if  and  only  if 

Xj  (Xj  Wi)  ^  Dij,  !jj  ( y,  A  hi)  A  Dij,  IJi  ( !Jj  T  hj)  A  Dij. 

The  first  inequality  states  that  the  horizontal  displacement  between  the  right  edge 
of  cell  i  and  the  left  edge  of  cell  j  does  not  exceed  D,j .  The  second  inequality 
requires  that  the  bottom  of  cell  j  is  no  more  than  Djj  above  the  top  of  cell  i,  and 
the  third  inequality  requires  that  the  bottom  of  cell  i  is  no  more  than  Dl3  above  the 
top  of  cell  j.  These  three  inequalities  together  are  equivalent  to  dist(C,;,  Cj)  <  . 

In  this  case,  we  do  not  need  to  introduce  any  new  variables. 

We  can  limit  the  i\-  (or  £ 2-)  distance  between  two  cells  in  a  similar  way.  Here 
we  introduce  one  new  variable  dv ,  which  will  serve  as  a  bound  on  the  vertical 
displacement  between  the  cells.  To  limit  the  £i-distance,  we  add  the  constraints 

Vj  {yi  T  hi)  '  dv,  yi  ( y3  -t-  hj)  C  dv,  dv  ^  0 

and  the  constraints 

Xj  (x,  T  Wi )  A  dv  C  Dij . 

(The  first  term  is  the  horizontal  displacement  and  the  second  is  an  upper  bound 
on  the  vertical  displacement.)  To  limit  the  Euclidean  distance  between  the  cells, 
we  replace  this  last  constraint  with 


( Xj  -  (Xi  +  Wi))2  +  dl  <  Djj. 
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Figure  8.20  Four  instances  of  an  optimal  floor  plan,  using  the  relative  po¬ 
sitioning  constraints  shown  in  figure  8.19.  In  each  case  the  objective  is  to 
minimize  the  perimeter,  and  the  same  minimum  spacing  constraint  between 
cells  is  imposed.  We  also  require  the  aspect  ratios  to  lie  between  1/5  and  5. 
The  four  cases  differ  in  the  minimum  areas  required  for  each  cell.  The  sum 
of  the  minimum  areas  is  the  same  for  each  case. 


Example  8.7  Figure  8.20  shows  an  example  with  5  cells,  using  the  ordering  constraints 
of  figure  8.19,  and  four  different  sets  of  constraints.  In  each  case  we  impose  the 
same  minimum  required  spacing  constraint,  and  the  same  aspect  ratio  constraint 
1/5  <  w i/hi  <  5.  The  four  cases  differ  in  the  minimum  required  cell  areas  A,.  The 
values  of  A;  are  chosen  so  that  the  total  minimum  required  area  y/5_,  A,  is  the  same 
for  each  case. 


8.8.3  Floor  planning  via  geometric  programming 

The  floor  planning  problem  can  also  be  formulated  as  a  geometric  program  in  the 
variables  x,,  yt,  Wi,  hi,  W,  H.  The  objectives  and  constraints  that  can  be  handled 
in  this  formulation  are  a  bit  different  from  those  that  can  be  expressed  in  the  convex 
formulation. 

First  we  note  that  the  bounding  box  constraints  (8.35)  and  the  relative  po- 
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sitioning  constraints  (8.34)  are  posynomial  inequalities,  since  the  lefthand  sides 
are  sums  of  variables,  and  the  righthand  sides  are  single  variables,  hence  monomi¬ 
als.  Dividing  these  inequalities  by  the  righthand  side  yields  standard  posynomial 
inequalities. 

In  the  geometric  programming  formulation  we  can  minimize  the  bounding  box 
area,  since  WH  is  a.  monomial,  hence  posynomial.  We  can  also  exactly  specify 
the  area  of  each  cell,  since  Wihi  =  Ai  is  a  monomial  equality  constraint.  On  the 
other  hand  alignment,  symmetry,  and  distance  constraints  cannot  be  handled  in 
the  geometric  programming  formulation.  Similarity,  however,  can  be;  indeed  it 
is  possible  to  require  that  one  cell  be  similar  to  another,  without  specifying  the 
scaling  ratio  (which  can  be  treated  as  just  another  variable). 
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Exercises 

Projection  on  a  set 

8.1  Uniqueness  of  projection.  Show  that  if  C  C  R’1  is  nonempty,  closed  and  convex,  and  the 
norm  ||  •  ||  is  strictly  convex,  then  for  every  xo  there  is  exactly  one  x  £  C  closest  to  xo ■  In 
other  words  the  projection  of  xo  on  C  is  unique. 

8.2  [Web94,  Val64]  Chebyshev  characterization  of  convexity.  A  set  C  £  Rn  is  called  a  Cheby- 
shev  set  if  for  every  xo  £  R",  there  is  a  unique  point  in  C  closest  (in  Euclidean  norm) 
to  xo-  From  the  result  in  exercise  8.1,  every  nonempty,  closed,  convex  set  is  a  Chebyshev 
set.  In  this  problem  we  show  the  converse,  which  is  known  as  Motzkin’s  theorem. 

Let  C  £  R™  be  a  Chebyshev  set. 

(a)  Show  that  C  is  nonempty  and  closed. 

(b)  Show  that  Pc,  the  Euclidean  projection  on  C,  is  continuous. 

(c)  Suppose  xo  /  C.  Show  that  Pc(x)  =  Pc(x o)  for  all  x  =  dxo  +  (1  —  9)Pc{x o)  with 
0  <  6  <  1. 

(d)  Suppose  xo  /  C.  Show  that  Pc(x)  =  Pc(x o)  for  all  x  =  dx o  +  (1  —  9)Pc (*o)  with 

9  >  1. 

(e)  Combining  parts  (c)  and  (d),  we  can  conclude  that  all  points  on  the  ray  with  base 
Pc{x o)  and  direction  xo  —  Pc(x o)  have  projection  Pc{x o).  Show  that  this  implies 
that  C  is  convex. 

8.3  Euclidean  projection  on  proper  cones. 

(a)  Nonnegative  orthant.  Show  that  Euclidean  projection  onto  the  nonnegative  orthant 
is  given  by  the  expression  on  page  399. 

(b)  Positive  semidefinite  cone.  Show  that  Euclidean  projection  onto  the  positive  semidef- 
inite  cone  is  given  by  the  expression  on  page  399. 

(c)  Second-order  cone.  Show  that  the  Euclidean  projection  of  (xo,to)  on  the  second- 
order  cone 

K  =  {(x,t)  £  Rn+1  |  ||*||2  <  t} 

is  given  by 

f  0 

P/f(*0,to)  =  <  (xo,to) 

{  (1/2) ( 1  +  to/||*o||2)(*o,  ||a;o||2) 

8.4  The  Euclidean  projection  of  a  point  on  a  convex  set  yields  a  simple  separating  hyperplane 

{Pc(x o)  -  x0)T  (x  -  (l/2)(xo  +  Pc(x o)))  =  0. 

Find  a  counterexample  that  shows  that  this  construction  does  not  work  for  general  norms. 

8.5  [HUL93,  volume  1,  page  154]  Depth  function  and  signed  distance  to  boundary.  Let  C  C  R" 
be  a  nonempty  convex  set,  and  let  dist(*,  C)  be  the  distance  of  x  to  C  in  some  norm. 
We  already  know  that  dist(x,  C )  is  a  convex  function  of  x. 

(a)  Show  that  the  depth  function, 

depth)*,  C )  =  dist(*,  R™  \  C), 

is  concave  for  *  £  C. 

(b)  The  signed  distance  to  the  boundary  of  C  is  defined  as 

dist(x,  C)  x  /  C 
—  depth)*,  C)  x  £  C. 

Thus,  s(x)  is  positive  outside  C,  zero  on  its  boundary,  and  negative  on  its  interior. 
Show  that  s  is  a  convex  function. 


||*o||2  <  —to 

||*^0  j j 2  <  to 

ikolh  >  |to|. 
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Distance  between  sets 

8.6  Let  C,  D  be  convex  sets. 

(a)  Show  that  dist(C,  x  +  D)  is  a  convex  function  of  x. 

(b)  Show  that  dist(fC,  x  +  tD)  is  a  convex  function  of  ( x ,  t )  for  t  >  0. 

8.7  Separation  of  ellipsoids.  Let  Si  and  £2  be  two  ellipsoids  defined  as 

£1  =  {x\  (x  —  xi )TP1_1(x  —  xi)  <  1},  £2  —  {x  |  (x  —  X2)T Pf1  (x  —  X2)  <  1}, 

where  Pi,  P2  £  S"  +  .  Show  that  £1  fl  £2  =  0  if  and  only  if  there  exists  an  a  £  R"  with 

|| 2  +  ||Ri^2a||2  <  aT (xi  —  xof)- 

8.8  Intersection  and  containment  of  poly hedra.  Let  Vi  and  V2  be  two  polyhedra  defined  as 

Vi  =  {x\  Ax  <  b},  V2  =  {x\  Fx  A  g}, 

with  A  £  Rmxn,  b  £  Rm,  F  £  Rpx”,  g  £  Rp.  Formulate  each  of  the  following  problems 
as  an  LP  feasibility  problem,  or  a  set  of  LP  feasibility  problems. 

(a)  Find  a  point  in  the  intersection  "Pi  n  p2- 

(b)  Determine  whether  Vi  CP2. 

For  each  problem,  derive  a  set  of  linear  inequalities  and  equalities  that  forms  a  strong 
alternative,  and  give  a  geometric  interpretation  of  the  alternative. 

Repeat  the  question  for  two  polyhedra  defined  as 

Pi  =  conv{»i, . . .  ,vk},  P2  =  conv{wi, . . . ,  wl}- 


Euclidean  distance  and  angle  problems 


8.9  Closest  Euclidean  distance  matrix  to  given  data.  We  are  given  data  dij,  for  i,  j  =  1, . . . ,  n, 
which  are  corrupted  measurements  of  the  Euclidean  distances  between  vectors  in  Rfc: 

dij  =  || a:*  -  Xj\\2  +  Vij,  i,  j  =  1, . . .  ,n, 
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where  Vij  is  some  noise  or  error.  These  data  satisfy  dij  >  0  and  dij  =  dji,  for  all  i,  j.  The 
dimension  k  is  not  specified. 

Show  how  to  solve  the  following  problem  using  convex  optimization.  Find  a  dimension 
k  and  xi, . . .  ,xn  £  Rfe  so  that  ,  (dg  —  dij)2  is  minimized,  where  dij  =  \\xi  —  Xj  ||  2 , 

i,j  =  l,...,n.  In  other  words,  given  some  data  that  are  approximate  Euclidean  distances, 
you  are  to  find  the  closest  set  of  actual  Euclidean  distances,  in  the  least-squares  sense. 
Minimax  angle  fitting.  Suppose  that  yi , ... ,  ym  £  Rfc  are  affine  functions  of  a  variable 
x  £  Rn: 


Hi  =  Arx  +  bi,  i  =  1, . . .  ,m, 

and  zi, . . .  ,zm  £  Rfc  are  given  nonzero  vectors.  We  want  to  choose  the  variable  x,  subject 
to  some  convex  constraints,  ( e.g .,  linear  inequalities)  to  minimize  the  maximum  angle 
between  yi  and  Zi, 

max{Z(j/i,  zi), ...,  l(ym,Zm)}. 

The  angle  between  nonzero  vectors  is  defined  as  usual: 


/:(>!.  v)  =  COS 


«  2  V  2 


where  we  take  cos  J(a)  £  [0, 7r] .  We  are  only  interested  in  the  case  when  the  optimal 
objective  value  does  not  exceed  tt/2. 

Formulate  this  problem  as  a  convex  or  quasiconvex  optimization  problem.  When  the 
constraints  on  x  are  linear  inequalities,  what  kind  of  problem  (or  problems)  do  you  have 
to  solve? 
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8.11  Smallest  Euclidean  cone  containing  given  points.  In  R",  we  define  a  Euclidean  cone ,  with 
center  direction  c  ^  0,  and  angular  radius  6 ,  with  0  <  9  <  7r/2,  as  the  set 

{x  £  R"  |  Z(c,  x)  <  6}. 

(A  Euclidean  cone  is  a  second-order  cone,  i.e.,  it  can  be  represented  as  the  image  of  the 
second-order  cone  under  a  nonsingular  linear  mapping.) 

Let  ai, ,  am  £  Rn •  How  would  you  find  the  Euclidean  cone,  of  smallest  angular  radius, 
that  contains  a i, . . .  ,om?  (In  particular,  you  should  explain  how  to  solve  the  feasibility 
problem,  i.e.,  how  to  determine  whether  there  is  a  Euclidean  cone  which  contains  the 
points.) 

Extremal  volume  ellipsoids 

8.12  Show  that  the  maximum  volume  ellipsoid  enclosed  in  a  set  is  unique.  Show  that  the 
Lowner-John  ellipsoid  of  a  set  is  unique. 

8.13  Lowner-John  ellipsoid  of  a  simplex.  In  this  exercise  we  show  that  the  Lowner-John  el¬ 
lipsoid  of  a  simplex  in  R"  must  be  shrunk  by  a  factor  n  to  fit  inside  the  simplex.  Since 
the  Lowner-John  ellipsoid  is  affinely  invariant,  it  is  sufficient  to  show  the  result  for  one 
particular  simplex. 

Derive  the  Lowner-John  ellipsoid  £\j  for  the  simplex  C  =  conv{0,  ei, . . . ,  e„}.  Show  that 
£ij  must  be  shrunk  by  a  factor  1/n  to  fit  inside  the  simplex. 

8.14  Efficiency  of  ellipsoidal  inner  approximation.  Let  C  be  a  polyhedron  in  R"  described  as 
C  =  {x  |  Ax  <  &},  and  suppose  that  {x  \  Ax  -<  fo}  is  nonempty. 

(a)  Show  that  the  maximum  volume  ellipsoid  enclosed  in  C,  expanded  by  a  factor  n 
about  its  center,  is  an  ellipsoid  that  contains  C. 

(b)  Show  that  if  C  is  symmetric  about  the  origin,  i.e.,  of  the  form  C  =  {a:  |  —  1  Y  Ax  A 
1},  then  expanding  the  maximum  volume  inscribed  ellipsoid  by  a  factor  y/n.  gives 
an  ellipsoid  that  contains  C. 

8.15  Minimum  volume  ellipsoid  covering  union  of  ellipsoids.  Formulate  the  following  problem 
as  a  convex  optimization  problem.  Find  the  minimum  volume  ellipsoid  £  =  {x  \  (x  — 
*o)TA_1(a;  —  xo )  <  1}  that  contains  K  given  ellipsoids 

£i  =  {x  |  xT AiX  +  2bJ x  +  d  <  0},  i  =  1, . . . ,  K. 

Hint.  See  appendix  B. 

8.16  Maximum  volume  rectangle  inside  a  polyhedron.  Formulate  the  following  problem  as  a 
convex  optimization  problem.  Find  the  rectangle 

71  =  {x  £  Rn  |  H  x  <  u} 

of  maximum  volume,  enclosed  in  a  polyhedron  V  =  {x  \  Ax  <  &}.  The  variables  are 
l,u  £  R".  Your  formulation  should  not  involve  an  exponential  number  of  constraints. 

Centering 

8.17  Affine  invariance  of  analytic  center.  Show  that  the  analytic  center  of  a  set  of  inequalities  is 
affine  invariant.  Show  that  it  is  invariant  with  respect  to  positive  scaling  of  the  inequalities. 

8.18  Analytic  center  and  redundant  inequalities.  Two  sets  of  linear  inequalities  that  describe 
the  same  polyhedron  can  have  different  analytic  centers.  Show  that  by  adding  redundant 
inequalities,  we  can  make  any  interior  point  xo  of  a  polyhedron 


V  =  {a:  £  Rn  |  Ax  <  b} 
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the  analytic  center.  More  specifically,  suppose  A  £  Rmx”  ancj  aXq  ^  5  Show  that  there 
exist  c  £  Rn,  7  £  R,  and  a  positive  integer  q,  such  that  V  is  the  solution  set  of  the  m  +  q 
inequalities 

Ax  A  b,  cT x  <  7,  cT x  <7,  . . . ,  cT x  <  7  (8.36) 

(where  the  inequality  cTx  <  7  is  added  q  times),  and  xo  is  the  analytic  center  of  (8.36). 

8.19  Let  Xac  be  the  analytic  center  of  a  set  of  linear  inequalities 

ajx  <bi,  i  =  1, . . .  ,  m, 

and  define  H  as  the  Hessian  of  the  logarithmic  barrier  function  at  xac: 


1 

(&i  -  ajx ac)2 


T 

CLidi  . 


Show  that  the  fcth  inequality  is  redundant  ( i.e .,  it  can  be  deleted  without  changing  the 
feasible  set)  if 

bk  -  CLk  X&C  >  m(al H^1ak)1/2 ■ 

8.20  Ellipsoidal  approximation  from  analytic  center  of  linear  matrix  inequality.  Let  C  be  the 
solution  set  of  the  LMI 

X1A1  +  X2A2  +  •  •  •  +  xnAn  A  B1 
where  A;,  B  £  Sm,  and  let  xac  be  its  analytic  center.  Show  that 


f—  f  '  f—  Couter , 


where 

dinner  —  {A  |  (*^  X&c)  H (x  X&c)  Y  1}, 

fouter  =  {X  \  (X  ~  Xac)T H(X  ~  Xa,c)  <  m(m  -  1)}, 

and  H  is  the  Hessian  of  the  logarithmic  barrier  function 

—  log  det(R  —  aq^i  —  X2A2 - xnAn) 


evaluated  at  xac- 

8.21  [BYT99]  Maximum  likelihood  interpretation  of  analytic  center.  We  use  the  linear  mea¬ 
surement  model  of  page  352, 

y  =  Ax  +  v, 

where  A  £  Rmxn.  We  assume  the  noise  components  Vi  are  IID  with  support  [—1, 1].  The 
set  of  parameters  x  consistent  with  the  measurements  y  £  Rm  is  the  polyhedron  defined 
by  the  linear  inequalities 

~1  +  y  r)  Ax  31  +  y.  (8.37) 

Suppose  the  probability  density  function  of  Vi  has  the  form 

p(w)  =  J  “-(1-«2)r 

'  '  ^  0  otherwise, 

where  r  >  1  and  ar  >  0.  Show  that  the  maximum  likelihood  estimate  of  x  is  the  analytic 
center  of  (8.37). 

8.22  Center  of  gravity.  The  center  of  gravity  of  a  set  C  C  R”  with  nonempty  interior  is  defined 
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The  center  of  gravity  is  affine  invariant,  and  (clearly)  a  function  of  the  set  C,  and  not 
its  particular  description.  Unlike  the  centers  described  in  the  chapter,  however,  it  is  very 
difficult  to  compute  the  center  of  gravity,  except  in  simple  cases  ( e.g .,  ellipsoids,  balls, 
sinrplexes). 

Show  that  the  center  of  gravity  xcg  is  the  minimizer  of  the  convex  function 


||u  —  x\\i  du. 


Classification 

8.23  Robust  linear  discrimination.  Consider  the  robust  linear  discrimination  problem  given 
in  (8.23). 

(a)  Show  that  the  optimal  value  t *  is  positive  if  and  only  if  the  two  sets  of  points  can 
be  linearly  separated.  When  the  two  sets  of  points  can  be  linearly  separated,  show 
that  the  inequality  || a.|| 2  <  1  is  tight,  i.e.,  we  have  ||a*||2  =  1,  for  the  optimal  a*. 

(b)  Using  the  change  of  variables  a  =  a/t,  b  =  b/t,  prove  that  the  problem  (8.23)  is 
equivalent  to  the  QP 

minimize  ||ci||2 

subject  to  aT Xi  —  b  >  1,  i  =  1, . . . ,  N 

dT yi  —  b<  —1,  i  =  1, . . . ,  M. 

8.24  Linear  discrimination  maximally  robust  to  weight  errors.  Suppose  we  are  given  two  sets  of 
points  {*1, . . . ,  xjv}  and  and  {yi, . . . ,  Vm}  in  R’1  that  can  be  linearly  separated.  In  §8.6.1 
we  showed  how  to  find  the  affine  function  that  discriminates  the  sets,  and  gives  the  largest 
gap  in  function  values.  We  can  also  consider  robustness  with  respect  to  changes  in  the 
vector  a,  which  is  sometimes  called  the  weight  vector.  For  a  given  a  and  b  for  which 
f{x)  =  aTx  —  b  separates  the  two  sets,  we  define  the  weight  error  margin  as  the  norm  of 
the  smallest  u  £  R"  such  that  the  affine  function  (a  +  u)Tx  —  b  no  longer  separates  the 
two  sets  of  points.  In  other  words,  the  weight  error  margin  is  the  maximum  p  such  that 

(a  +  u)T Xi  >  b,  i  =  1, . . . ,  N,  (a  +  u)Tyj  <6,  i  =  1, . . . ,  M, 

holds  for  all  u  with  ||zt||2  <  p. 

Show  how  to  find  a  and  b  that  maximize  the  weight  error  margin,  subject  to  the  normal¬ 
ization  constraint  ||a||2  <  1. 

8.25  Most  spherical  separating  ellipsoid.  We  are  given  two  sets  of  vectors  xi, . . .  ,xn  £  R”,  and 

j/i, ... ,  i/m  £  Rn,  and  wish  to  find  the  ellipsoid  with  minimum  eccentricity  (i.e.,  minimum 
condition  number  of  the  defining  matrix)  that  contains  the  points  xi, ,  xn,  but  not  the 
points  1/1,  ,  2/m-  Formulate  this  as  a  convex  optimization  problem. 

Placement  and  floor  planning 

8.26  Quadratic  placement.  We  consider  a  placement  problem  in  R2,  defined  by  an  undirected 
graph  A  with  N  nodes,  and  with  quadratic  costs: 

minimize  J2(i,j)eA  WXi  ~  xi  lli- 

The  variables  are  the  positions  Xi  £  R2,  i  =  1, . . . ,  M.  The  positions  Xi,  i  =  M  +  1, . . . ,  N 
are  given.  We  define  two  vectors  u,  v  £  RM  by 

u  =  (*11,0:21,  •  •  • ,  xm  1),  v  =  (*12,  *22,  •  •  ■ ,  XM2), 

containing  the  first  and  second  components,  respectively,  of  the  free  nodes. 
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Show  that  u  and  v  can  be  found  by  solving  two  sets  of  linear  equations, 

Cu  =  di,  Cv  =  d2, 

where  C  £  SM .  Give  a  simple  expression  for  the  coefficients  of  C  in  terms  of  the  graph  A. 

8.27  Problems  with  minimum  distance  constraints.  We  consider  a  problem  with  variables 
ay, ... ,  xn  €  Rfc.  The  objective,  fo(xi,  •  •  •  ,xn),  is  convex,  and  the  constraints 

fi(xi, . . .  ,xN)  <  0,  *  =  1, . . . ,  TO, 

are  convex  ( i.e .,  the  functions  /,;  :  R'Yfc  — >  R  are  convex).  In  addition,  we  have  the 
minimum  distance  constraints 


1 1 X i  Xj  1 1  2  P  Drain ,  1  L  j  —  1 ,  *  *  *  ,  N • 

In  general,  this  is  a  hard  nonconvex  problem. 

Following  the  approach  taken  in  floorplanning,  we  can  form  a  convex  restriction  of  the 
problem,  i.e.,  a  problem  which  is  convex,  but  has  a  smaller  feasible  set.  (Solving  the 
restricted  problem  is  therefore  easy,  and  any  solution  is  guaranteed  to  be  feasible  for  the 
nonconvex  problem.)  Let  ay  £  Rfe,  for  i  <  j,  i,j  =  1, . . . ,  IV,  satisfy  ||ayj|2  =  1. 

Show  that  the  restricted  problem 

minimize  fo(xi, . . . ,  xn) 

subject  to  fi(xi, . . .  ,xn)  <  0,  i  =  l,...,m 

Oij  ( Xi  X j  )  P  Dmin ,  i  <  j ,  i,  j  ~  1,  •  •  •  ,  -W 

is  convex,  and  that  every  feasible  point  satisfies  the  minimum  distance  constraint. 
Remark.  There  are  many  good  heuristics  for  choosing  the  directions  ay.  One  simple 
one  starts  with  an  approximate  solution  xi ,...,xn  (that  need  not  satisfy  the  minimum 
distance  constraints).  We  then  set  ay  =  ( Xi  —  Xj)/\\xi  —  Xj  ||2- 

Miscellaneous  problems 

8.28  Let  "Pi  and  V2  be  two  polyhedra  described  as 

Vi  =  {x  \  Ax  ■<  b}  ,  V2  =  {x\  1  ^  Cx  -<  1}  , 

where  A  £  RmXn,  C  £  Rpxn,  and  b  £  Rm.  The  polyhedron  V2  is  symmetric  about  the 
origin.  For  t  >  0  and  xc  £  R",  we  use  the  notation  tV 2  +  xc  to  denote  the  polyhedron 

tV 2  +  xc  =  {tx  +  xc  |  x  £  V2}, 

which  is  obtained  by  first  scaling  V2  by  a  factor  t  about  the  origin,  and  then  translating 
its  center  to  xc. 

Show  how  to  solve  the  following  two  problems,  via  an  LP,  or  a  set  of  LPs. 

(a)  Find  the  largest  polyhedron  tVz  +  xc  enclosed  in  Vi,  i.e., 

maximize  t 
subject  to  tP 2  +  xc  CVi 
t  >  0. 

(b)  Find  the  smallest  polyhedron  tV 2  +  xc  containing  V\,  i.e., 

minimize  t 

subject  to  "Pi  C  tP2  +  xc 
t  >  0. 
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In  both  problems  the  variables  are  t  £  R  and  xc  £  R™. 

8.29  Outer  polyhedral  approximations.  Let  V  =  {x  £  Ru  |  Ax  A  b}  be  a  polyhedron,  and 
C  C  Rn  a  given  set  (not  necessarily  convex).  Use  the  support  function  Sc  to  formulate 
the  following  problem  as  an  LP: 

minimize  t 
subject  to  C  C  tV  +  x 
t  >  0. 

Here  tV  +  x  =  {tu  +  x  \  u  £  V},  the  polyhedron  V  scaled  by  a  factor  of  t  about  the  origin, 
and  translated  by  x.  The  variables  are  t  £  R  and  x  £  R". 

8.30  Interpolation  with  piecewise-arc  curve.  A  sequence  of  points  oi, . . . ,  an  £  R2  is  given.  We 
construct  a  curve  that  passes  through  these  points,  in  order,  and  is  an  arc  ( i.e .,  part  of  a 
circle)  or  line  segment  (which  we  think  of  as  an  arc  of  infinite  radius)  between  consecutive 
points.  Many  arcs  connect  ai  and  at+i ;  we  parameterize  these  arcs  by  giving  the  angle 
9i  £  (— 7r,  7r )  between  its  tangent  at  ai  and  the  line  segment  [a*,  Oj+i].  Thus,  9i  —  0  means 
the  arc  between  at  and  a^+i  is  in  fact  the  line  segment  [oi,o*+ 1];  9i  =  7t/2  means  the  arc 
between  at  and  m+ 1  is  a  half-circle  (above  the  linear  segment  [01,02]);  9i  =  — 7r/2  means 
the  arc  between  ai  and  Oi+i  is  a  half-circle  (below  the  linear  segment  [01,02]).  This  is 
illustrated  below. 


Our  curve  is  completely  specified  by  the  angles  9i, ...  ,6„,  which  can  be  chosen  in  the 
interval  (— 7r,  7r).  The  choice  of  9i  affects  several  properties  of  the  curve,  for  example,  its 
total  arc  length  L,  or  the  joint  angle  discontinuities,  which  can  be  described  as  follows. 
At  each  point  ai,  i  =  2, . . . ,  n  —  1,  two  arcs  meet,  one  coming  from  the  previous  point  and 
one  going  to  the  next  point.  If  the  tangents  to  these  arcs  exactly  oppose  each  other,  so  the 
curve  is  differentiable  at  at,  we  say  there  is  no  joint  angle  discontinuity  at  ai.  In  general, 
we  define  the  joint  angle  discontinuity  at  at  as  \9i-i+9i+ipi\,  where  ipi  is  the  angle  between 
the  line  segment  [a^dj+i]  and  the  line  segment  [ai_i,a,],  i.e.,  ipi  =  Z(a;  —  dj+i,  ai_i  —  ai). 
This  is  shown  below.  Note  that  the  angles  ipi  are  known  (since  the  Oi  are  known). 


We  define  the  total  joint  angle  discontinuity  as 

n 

D  =  'y  '  \9i—i  +  9i  +  ipi | . 

2=2 

Formulate  the  problem  of  minimizing  total  arc  length  length  L,  and  total  joint  angle 
discontinuity  D,  as  a  bi-criterion  convex  optimization  problem.  Explain  how  you  would 
find  the  extreme  points  on  the  optimal  trade-off  curve. 


Part  III 
Algorithms 


Chapter  9 


Unconstrained  minimization 


9.1  Unconstrained  minimization  problems 

In  this  chapter  we  discuss  methods  for  solving  the  unconstrained  optimization 
problem 

minimize  f{x)  (9-1) 

where  /  :  R™  — >  R  is  convex  and  twice  continuously  differentiable  (which  implies 
that  dom  /  is  open).  We  will  assume  that  the  problem  is  solvable,  i.e.,  there  exists 
an  optimal  point  x*.  (More  precisely,  the  assumptions  later  in  the  chapter  will 
imply  that  x*  exists  and  is  unique.)  We  denote  the  optimal  value,  infa,  f{x)  = 
as  p*. 

Since  /  is  differentiable  and  convex,  a  necessary  and  sufficient  condition  for  a 
point  x*  to  be  optimal  is 

V/(ar*)  =  0  (9.2) 

(see  §4.2.3).  Thus,  solving  the  unconstrained  minimization  problem  (9.1)  is  the 
same  as  finding  a  solution  of  (9.2),  which  is  a  set  of  n  equations  in  the  n  variables 
Xi, . . .  ,xn.  In  a  few  special  cases,  we  can  find  a  solution  to  the  problem  (9.1)  by 
analytically  solving  the  optimality  equation  (9.2),  but  usually  the  problem  must 
be  solved  by  an  iterative  algorithm.  By  this  we  mean  an  algorithm  that  computes 
a  sequence  of  points  x^°\  x^\  . . .  £  dom  /  with  f(x^)  — >  p*  as  k  — >  oo.  Such 
a  sequence  of  points  is  called  a  minimizing  sequence  for  the  problem  (9.1).  The 
algorithm  is  terminated  when  f(x^)  —  p*  <  e,  where  e  >  0  is  some  specified 
tolerance. 

Initial  point  and  sublevel  set 

The  methods  described  in  this  chapter  require  a  suitable  starting  point  The 
starting  point  must  lie  in  dom  /,  and  in  addition  the  sublevel  set 

S  =  {x  G  dom/  |  f(x)  <  f(x{0))}  (9.3) 

must  be  closed.  This  condition  is  satisfied  for  all  a;®  £  dom  /  if  the  function  f  is 
closed ,  i.e.,  all  its  sublevel  sets  are  closed  (see  §A.3.3).  Continuous  functions  with 
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dom  /  =  R"  are  closed,  so  if  dom  /  =  Rn,  the  initial  sublevel  set  condition  is 
satisfied  by  any  x'°K  Another  important  class  of  closed  functions  are  continuous 
functions  with  open  domains,  for  which  f(x )  tends  to  infinity  as  x  approaches 

bd  dom  /. 


9.1.1  Examples 

Quadratic  minimization  and  least-squares 

The  general  convex  quadratic  minimization  problem  has  the  form 

minimize  (1/2  )xTPx  +  qTx  +  r,  (9-4) 

where  P  g  S’j,  g  S  Rn,  and  r  £  R.  This  problem  can  be  solved  via  the  optimality 
conditions,  Px *  +  q  =  0,  which  is  a  set  of  linear  equations.  When  P  >-  0,  there  is 
a  unique  solution,  x*  =  —P~1q.  In  the  more  general  case  when  P  is  not  positive 
definite,  any  solution  of  Px *  =  —  q  is  optimal  for  (9.4);  if  Px *  =  —  q  does  not 
have  a  solution,  then  the  problem  (9.4)  is  unbounded  below  (see  exercise  9.1).  Our 
ability  to  analytically  solve  the  quadratic  minimization  problem  (9.4)  is  the  basis 
for  Newton’s  method,  a  powerful  method  for  unconstrained  minimization  described 
in  §9.5. 

One  special  case  of  the  quadratic  minimization  problem  that  arises  very  fre¬ 
quently  is  the  least-squares  problem 

minimize  ||  Ax  —  5|||  =  xT  (AT  A)x  —  2  (ATb)Tx  +  bTb. 

The  optimality  conditions 

AtAx*  =  ATb 

are  called  the  normal  equations  of  the  least-squares  problem. 


Unconstrained  geometric  programming 

As  a  second  example,  we  consider  an  unconstrained  geometric  program  in  convex 
form, 

minimize  f{x)  =  log  (E”=i  exp(a  Jx  +  bi ))  . 

The  optimality  condition  is 


V/(ar*) 


1 

EJLi  exp  (a  J a :*  +  bj) 


m 

^exp  (ajx*  +  bi)a,i  =  0, 
i= 1 


which  in  general  has  no  analytical  solution,  so  here  we  must  resort  to  an  iterative 
algorithm.  For  this  problem,  dom  /  =  Rra,  so  any  point  can  be  chosen  as  the 
initial  point  a/0-*. 


Analytic  center  of  linear  inequalities 

We  consider  the  optimization  problem 

minimize  f(x)  =  -  Yh=i  log(&i  ~  aTx)i 


(9.5) 


9.1  Unconstrained  minimization  problems 
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where  the  domain  of  /  is  the  open  set 

dom  f  =  {x  \  aj x  <  bi,  i  =  1, . . . ,  m}. 

The  objective  function  /  in  this  problem  is  called  the  logarithmic  barrier  for  the 
inequalities  afx  <  bi.  The  solution  of  (9.5),  if  it  exists,  is  called  the  analytic 
center  of  the  inequalities.  The  initial  point  x^  must  satisfy  the  strict  inequalities 
aj  x^  <  bi ,  i  =  1 , ,m.  Since  /  is  closed,  the  sublevel  set  S  for  any  such  point 
is  closed. 

Analytic  center  of  a  linear  matrix  inequality 

A  closely  related  problem  is 

minimize  f(x)  =  logdetF(x)~1  (9.6) 

where  F  :  Rn  — >  Sp  is  affine,  i.e., 

F(x)  =  F0  +  xi F±  H - 1-  xnFn, 

with  Fi  £  Sp.  Here  the  domain  of  /  is 

dom /  =  {x  |  F(x)  >-  0}. 

The  objective  function  f  is  called  the  logarithmic  barrier  for  the  linear  matrix 
inequality  F(x)  ^  0,  and  the  solution  (if  it  exists)  is  called  the  analytic  center  of 
the  linear  matrix  inequality.  The  initial  point  must  satisfy  the  strict  linear 
matrix  inequality  F(x^)  ^  0.  As  in  the  previous  example,  the  sublevel  set  of  any 
such  point  will  be  closed,  since  /  is  closed. 


9.1.2  Strong  convexity  and  implications 

In  much  of  this  chapter  (with  the  exception  of  §9.6)  we  assume  that  the  objective 
function  is  strongly  convex  on  S,  which  means  that  there  exists  an  m  >  0  such  that 

V2f{x)  y  ml  (9.7) 

for  all  x  €  S.  Strong  convexity  has  several  interesting  consequences.  For  x,y  £  S 
we  have 

f(y)  =  f(x)  +  Vf(x)T{y  -  x)  +  ^{y  -  x)TV2f(z)(y  -  x) 

for  some  z  on  the  line  segment  [x,  y}.  By  the  strong  convexity  assumption  (9.7),  the 
last  term  on  the  righthand  side  is  at  least  {m/2)\\y  —  x\\\,  so  we  have  the  inequality 

f(y)  >  f{x)  +  S7f(x)T(y  -x)  +  j\\y-  x\\%  (9.8) 

for  all  x  and  y  in  S.  When  m  =  0,  we  recover  the  basic  inequality  characterizing 
convexity;  for  m  >  0  we  obtain  a  better  lower  bound  on  f(y)  than  follows  from 
convexity  alone. 
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We  will  first  show  that  the  inequality  (9.8)  can  be  used  to  bound  f(x)  —  p*, 
which  is  the  suboptimality  of  the  point  x,  in  terms  of  ||  V/(ar) || 2 -  The  righthand 
side  of  (9.8)  is  a  convex  quadratic  function  of  y  (for  fixed  x).  Setting  the  gradient 
with  respect  to  y  equal  to  zero,  we  find  that  y  =  x  —  (1  /m)S7 f(x)  minimizes  the 
righthand  side.  Therefore  we  have 

f{y)  >  f(x)  +  \7f(x)T(y-x)  +  ^-\\y-x\\22 

>  f(x)  +  'Vf(x)T(y-x)+™\\y-x\\l 

=  f{x)~  ^\\S7f{x)\\22. 

Since  this  holds  for  any  y  £  S,  we  have 

P*  >  f(x)  -  2^||V/(a:)|||.  (9.9) 

This  inequality  shows  that  if  the  gradient  is  small  at  a  point,  then  the  point  is 
nearly  optimal.  The  inequality  (9.9)  can  also  be  interpreted  as  a  condition  for 
suboptimality  which  generalizes  the  optimality  condition  (9.2): 

II  V/(m)||2  <  (2me)1/2  =>  f(x)  -  p*  <  e.  (9.10) 


We  can  also  derive  a  bound  on  ||a:  —  x*\\2,  the  distance  between  x  and  any 
optimal  point  x *,  in  terms  of  ||  V./(ar) || 2 : 

||*-**||2<-||V/(*)||2.  (9.11) 

m 

To  see  this,  we  apply  (9.8)  with  y  =  x*  to  obtain 

P*  =  f{x*)  >  f(x) +  V.f{x)T(x*  -  x) ||x*-x||l 

>  f(x)-\\Vf(x)\\2\\x*-x\\2  +  ^\\x*-x\\l 

where  we  use  the  Cauchy-Schwarz  inequality  in  the  second  inequality.  Since  p*  < 
f(x),  we  must  have 

-||V/(*)||2  \\x*-x\\2  +  ^\\x*-x\\l<0, 

from  which  (9.11)  follows.  One  consequence  of  (9.11)  is  that  the  optimal  point  x * 
is  unique. 

Upper  bound  on  \72f(x) 

The  inequality  (9.8)  implies  that  the  sublevel  sets  contained  in  S  are  bounded,  so  in 
particular,  S  is  bounded.  Therefore  the  maximum  eigenvalue  of  V2/(a;),  which  is  a 
continuous  function  of  x  on  S,  is  bounded  above  on  S.  i.e.,  there  exists  a  constant 
M  such  that 


V2f(x)  ^  MI 


(9.12) 
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for  all  x  €  S.  This  upper  bound  on  the  Hessian  implies  for  any  x,  y  €  S, 


f(y)  <  f(x)  +  Vf(x)T(y  -  x)  +  y  ||y  -  x\\%, 
which  is  analogous  to  (9.8).  Minimizing  each  side  over  y  yields 


V*  <  f(x)  -  7^||V/(;r)||i, 


(9.13) 


(9.14) 


the  counterpart  of  (9.9). 

Condition  number  of  sublevel  sets 

From  the  strong  convexity  inequality  (9.7)  and  the  inequality  (9.12),  we  have 

ml  <  V2f(x)  ^  MI  (9.15) 

for  all  x  £  S.  The  ratio  n  =  M/m  is  thus  an  upper  bound  on  the  condition 
number  of  the  matrix  V2/( x),  i.e.,  the  ratio  of  its  largest  eigenvalue  to  its  smallest 
eigenvalue.  We  can  also  give  a  geometric  interpretation  of  (9.15)  in  terms  of  the 
sublevel  sets  of  /. 

We  define  the  width  of  a  convex  set  C  C  R",  in  the  direction  q ,  where  ||g||2  =  1, 
as 

W(C,q)  =  sup  qT z  —  inf  qT z. 

zee  z^c 

The  minimum  width  and  maximum  width  of  C  are  given  by 

Wmin  =  inf  W(C,q),  Wmax  =  sup  W(C,q). 

IMh=1  IMl2=i 

The  condition  number  of  the  convex  set  C  is  defined  as 

_ rUn,  __  Wmax 

cond(C)  2  , 

min 

ie.,  the  square  of  the  ratio  of  its  maximum  width  to  its  minimum  width.  The 
condition  number  of  C  gives  a  measure  of  its  anisotropy  or  eccentricity.  If  the 
condition  number  of  a  set  C  is  small  (say,  near  one)  it  means  that  the  set  has 
approximately  the  same  width  in  all  directions,  i.e.,  it  is  nearly  spherical.  If  the 
condition  number  is  large,  it  means  that  the  set  is  far  wider  in  some  directions  than 
in  others. 


Example  9.1  Condition  number  of  an  ellipsoid.  Let  £  be  the  ellipsoid 

£  =  {x  |  (a:  —  xo )T A-1  (x  —  xo)  <  1}, 

where  A  £  S"  +  .  The  width  of  £  in  the  direction  q  is 

supgTz  —  inf  gT«  = 
zee  ze£ 


{\\A1/2q\\2  +  qT x0)  -  H|d1/2<j||2  +  qT X o) 

m1/2qh- 
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It  follows  that  its  minimum  and  maximum  width  are 

Wmin  =  2Amin(^)1/2,  Wmax  =  2Amax(A)1/2, 
and  its  condition  number  is 

co,’d(f)  = 

where  n{A)  denotes  the  condition  number  of  the  matrix  A ,  i.e.,  the  ratio  of  its 
maximum  singular  value  to  its  minimum  singular  value.  Thus  the  condition  number 
of  the  ellipsoid  £  is  the  same  as  the  condition  number  of  the  matrix  A  that  defines 
it. 


Now  suppose  /  satisfies  ml  A  X/2f(x)  -<  MI  for  all  x  £  S.  We  will  derive 
a  bound  on  the  condition  number  of  the  a-sublevel  Ca  =  {x  |  /( x)  <  a},  where 
p*  <  a  <  /(x(0^).  Applying  (9.13)  and  (9.8)  with  x  =  x* ,  we  have 

T*  +  (M/2)\\y  -  **||2  >  f(y)  >  p*  +  (m/2)\\y  -  x*f2. 

This  implies  that  Hinner  C  Ca  C  Bouter  where 

Anner  =  {V  \  h  ~  X*\\2  <  (2(a  -  p*)/M)1/2}, 

Souter  =  {y  |  ||y  -  X*\\2  <  (2 (a  -  p*)/m)1/2}. 

In  other  words,  the  a-sublevel  set  contains  Binner,  and  is  contained  in  Souter)  which 
are  balls  with  radii 

(2(a  -  p*)/M)1/2,  (2(a  -  p*)/?n)1/2, 

respectively.  The  ratio  of  the  radii  squared  gives  an  upper  bound  on  the  condition 
number  of  Ca: 

M 

cond(CQ)  <  — . 

m 

We  can  also  give  a  geometric  interpretation  of  the  condition  number  «(V2  f(x*)) 
of  the  Hessian  at  the  optimum.  From  the  Taylor  series  expansion  of  /  around  x *, 

f(y )  ~  P*  +  \{y  -  x*)TV2f(x*){y  -  a;*), 

we  see  that,  for  a  close  to  p* , 

Ca  «  {y  |  (y  -  a;*)TV2/ (x*)(y  -  x*)  <  2 (a  -  p*)}, 

i.e.,  the  sublevel  set  is  well  approximated  by  an  ellipsoid  with  center  x*.  Therefore 

lim  cond(Ca)  =  k{\72 f(x*)). 

Oi—>p* 

We  will  see  that  the  condition  number  of  the  sublevel  sets  of  /  (which  is  bounded 
by  M /m)  has  a  strong  effect  on  the  efficiency  of  some  common  methods  for  uncon¬ 
strained  minimization. 


9.2  Descent  methods 
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The  strong  convexity  constants 

It  must  be  kept  in  mind  that  the  constants  m  and  M  are  known  only  in  rare  cases, 
so  the  inequality  (9.10)  cannot  be  used  as  a  practical  stopping  criterion.  It  can  be 
considered  a  conceptual  stopping  criterion;  it  shows  that  if  the  gradient  of  /  at  x 
is  small  enough,  then  the  difference  between  f(x)  and  p*  is  small.  If  we  terminate 
an  algorithm  when  || ) || 2  <  V,  where  77  is  chosen  small  enough  to  be  (very 
likely)  smaller  than  (me)1/2,  then  we  have  f(x^)  —  p*  <  e  (very  likely). 

In  the  following  sections  we  give  convergence  proofs  for  algorithms,  which  in¬ 
clude  bounds  on  the  number  of  iterations  required  before  /(av  ■*)  —  p*  <  e,  where 
e  is  some  positive  tolerance.  Many  of  these  bounds  involve  the  (usually  unknown) 
constants  m  and  M,  so  the  same  comments  apply.  These  results  are  at  least  con¬ 
ceptually  useful;  they  establish  that  the  algorithm  converges,  even  if  the  bound  on 
the  number  of  iterations  required  to  reach  a  given  accuracy  depends  on  constants 
that  are  unknown. 

We  will  encounter  one  important  exception  to  this  situation.  In  §9.6  we  will 
study  a  special  class  of  convex  functions,  called  self-concordant,  for  which  we  can 
provide  a  complete  convergence  analysis  (for  Newton’s  method)  that  does  not  de¬ 
pend  on  any  unknown  constants. 


9.2  Descent  methods 

The  algorithms  described  in  this  chapter  produce  a  minimizing  sequence  x^k\  k  = 
1 , . . . ,  where 

x(k+1)  =  x{k)  +  t{k)Ax{k) 

and  t >  0  (except  when  x ^  is  optimal).  Here  the  concatenated  symbols  A  and 
x  that  form  Ax  are  to  be  read  as  a  single  entity,  a  vector  in  R"  called  the  step  or 
search  direction  (even  though  it  need  not  have  unit  norm),  and  k  =  0, 1, . . .  denotes 
the  iteration  number.  The  scalar  t ^  >  0  is  called  the  step  size  or  step  length  at 
iteration  k  (even  though  it  is  not  equal  to  ||x^fe+1^  —  ||  unless  || ||  =  1). 

The  terms  ‘search  step’  and  ‘scale  factor’  are  more  accurate,  but  ‘search  direction’ 
and  ‘step  length’  are  the  ones  widely  used.  When  we  focus  on  one  iteration  of 
an  algorithm,  we  sometimes  drop  the  superscripts  and  use  the  lighter  notation 
x+  =  x  +  tAx,  or  x  :=  x  +  tAx,  in  place  of  =  x ^  +  t^ Ax^k\ 

All  the  methods  we  study  are  descent  methods,  which  means  that 

/(*' (fc+1))  < 

except  when  x ^  is  optimal.  This  implies  that  for  all  k  we  have  xSk>  £  S,  the  initial 
sublevel  set,  and  in  particular  we  have  x ^  £  domf.  From  convexity  we  know 
that  Vf(x^)T(g  -  x <*>)  >  0  implies  f{y)  >  f(x^),  so  the  search  direction  in  a 
descent  method  must  satisfy 

X7f(x{k))T Ax{k)  <  0, 

i.e.,  it  must  make  an  acute  angle  with  the  negative  gradient.  We  call  such  a 
direction  a  descent  direction  (for  /,  at  x^). 
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The  outline  of  a  general  descent  method  is  as  follows.  It  alternates  between  two 
steps:  determining  a  descent  direction  Ax,  and  the  selection  of  a  step  size  t. 


Algorithm  9.1  General  descent  method. 

given  a  starting  point  x  £  dom /. 

repeat 

1.  Determine  a  descent  direction  Ax. 

2.  Line  search.  Choose  a  step  size  t  >  0. 

3.  Update,  x  :=  x  +  tAx. 
until  stopping  criterion  is  satisfied. 


The  second  step  is  called  the  line  search  since  selection  of  the  step  size  t  deter¬ 
mines  where  along  the  line  {x  +  tAx  |  t  £  R+}  the  next  iterate  will  be.  (A  more 
accurate  term  might  be  ray  search.) 

A  practical  descent  method  has  the  same  general  structure,  but  might  be  or¬ 
ganized  differently.  For  example,  the  stopping  criterion  is  often  checked  while,  or 
immediately  after,  the  descent  direction  Ax  is  computed.  The  stopping  criterion 
is  often  of  the  form  ||  V/(x)  || 2  <  V,  where  77  is  small  and  positive,  as  suggested  by 
the  suboptimality  condition  (9.9). 

Exact  line  search 

One  line  search  method  sometimes  used  in  practice  is  exact  line  search ,  in  which  t 
is  chosen  to  minimize  /  along  the  ray  {x  +  tAx  j  £  >  0}: 

t  =  argmins>0  /(x  +  sAx).  (9.16) 

An  exact  line  search  is  used  when  the  cost  of  the  minimization  problem  with  one 
variable,  required  in  (9.16),  is  low  compared  to  the  cost  of  computing  the  search 
direction  itself.  In  some  special  cases  the  minimizer  along  the  ray  can  be  found  an¬ 
alytically,  and  in  others  it  can  be  computed  efficiently.  (This  is  discussed  in  §9.7.1.) 

Backtracking  line  search 

Most  line  searches  used  in  practice  are  inexact:  the  step  length  is  chosen  to  ap¬ 
proximately  minimize  f  along  the  ray  {x  +  tAx  |  t  >  0},  or  even  to  just  reduce 
/  ‘enough’.  Many  inexact  line  search  methods  have  been  proposed.  One  inexact 
line  search  method  that  is  very  simple  and  quite  effective  is  called  backtracking  line 
search.  It  depends  on  two  constants  a,  /3  with  0  <  a  <  0.5,  0  <  0  <  1. 


Algorithm  9.2  Backtracking  line  search. 

given  a  descent  direction  Ax  for  /  at  x  £  dom/,  a  £  (0,  0.5),  /3  £  (0, 1). 
t  :=  1. 

while  f(x  +  tAx)  >  f(x)  +  atV/(x)TAx,  t  :=  (3t. 
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Figure  9.1  Backtracking  line  search.  The  curve  shows  /,  restricted  to  the  line 
over  which  we  search.  The  lower  dashed  line  shows  the  linear  extrapolation 
of  /,  and  the  upper  dashed  line  has  a  slope  a  factor  of  a  smaller.  The 
backtracking  condition  is  that  /  lies  below  the  upper  dashed  line,  i.e.,  0  < 
t  <  to. 


The  line  search  is  called  backtracking  because  it  starts  with  unit  step  size  and 
then  reduces  it  by  the  factor  /3  until  the  stopping  condition  /(x  +  tAx)  <  f(x)  + 
atV  f(x)T  Ax  holds.  Since  Ax  is  a  descent  direction,  we  have  V/(x)r Ax  <  0,  so 
for  small  enough  t  we  have 

f(x  +  tAx)  «  f(x)  +  fV/(x)TAx  <  f(x)  +  atS7  f(x)T  Ax, 

which  shows  that  the  backtracking  line  search  eventually  terminates.  The  constant 
a  can  be  interpreted  as  the  fraction  of  the  decrease  in  f  predicted  by  linear  extrap¬ 
olation  that  we  will  accept.  (The  reason  for  requiring  a  to  be  smaller  than  0.5  will 
become  clear  later.) 

The  backtracking  condition  is  illustrated  in  figure  9.1.  This  figure  suggests, 
and  it  can  be  shown,  that  the  backtracking  exit  inequality  f(x  +  tAx)  <  f(x)  + 
odS7 f(x)T Ax  holds  for  t  >  0  in  an  interval  (0, to]-  It  follows  that  the  backtracking 
line  search  stops  with  a  step  length  t  that  satisfies 

t=  1,  or  t€(/3to,to]- 

The  first  case  occurs  when  the  step  length  t  =  1  satisfies  the  backtracking  condition, 
i.e.,  1  <  to-  In  particular,  we  can  say  that  the  step  length  obtained  by  backtracking 
line  search  satisfies 

t  >  min{l,  /3<0}- 

When  dom  /  is  not  all  of  Rn,  the  condition  /(x  +  fAx)  <  /(x)  +  atVf(x)T Ax 
in  the  backtracking  line  search  must  be  interpreted  carefully.  By  our  convention 
that  /  is  infinite  outside  its  domain,  the  inequality  implies  that  x  +  <Ax  £  dom  f. 
In  a  practical  implementation,  we  first  multiply  t  by  until  x  +  tAx  £  dom  /; 


466 


9  Unconstrained  minimization 


then  we  start  to  check  whether  the  inequality  f(x  +  tAx)  <  f(x)  +  at-V  f(x)T Ax 
holds. 

The  parameter  a  is  typically  chosen  between  0.01  and  0.3,  meaning  that  we 
accept  a  decrease  in  /  between  1%  and  30%  of  the  prediction  based  on  the  linear 
extrapolation.  The  parameter  f3  is  often  chosen  to  be  between  0.1  (which  corre¬ 
sponds  to  a  very  crude  search)  and  0.8  (which  corresponds  to  a  less  crude  search). 


9.3  Gradient  descent  method 

A  natural  choice  for  the  search  direction  is  the  negative  gradient  Ax  =  — V/( x). 
The  resulting  algorithm  is  called  the  gradient  algorithm  or  gradient  descent  method. 


Algorithm  9.3  Gradient  descent  method. 

given  a  starting  point  x  £  dom/. 
repeat 

1.  Ax  :=  —Vf(x). 

2.  Line  search.  Choose  step  size  t  via  exact  or  backtracking  line  search. 

3.  Update,  x  :=  x  +  tAx. 
until  stopping  criterion  is  satisfied. 


The  stopping  criterion  is  usually  of  the  form  ||V/(a;)||2  <  g,  where  rj  is  small  and 
positive.  In  most  implementations,  this  condition  is  checked  after  step  1,  rather 
than  after  the  update. 


9.3.1  Convergence  analysis 

In  this  section  we  present  a  simple  convergence  analysis  for  the  gradient  method, 
using  the  lighter  notation  x+  =  x  +  t.Ax  for  x(k+1')  =  x^  +  t^Ax<'k\  where  Ax  = 
—Vf(x).  We  assume  /  is  strongly  convex  on  S,  so  there  are  positive  constants  m 
and  M  such  that  ml  ^  V2/(x)  A  MI  for  all  x  £  S.  Define  the  function  /  :  R  — >  R 
by  f(t )  =  f(x  —  tVf(x)),  i.e.,  /  as  a  function  of  the  step  length  t  in  the  negative 
gradient  direction.  In  the  following  discussion  we  will  only  consider  t  for  which 
x  —  tVf(x)  £  S.  From  the  inequality  (9.13),  with  y  =  x  —  fV f(x),  we  obtain  a 
quadratic  upper  bound  on  /: 

Mt2 

m  <  f(x)  -  t||V/(z)||!  +  —  ||V/(*)H2.  (9.17) 

Analysis  for  exact  line  search 

We  now  assume  that  an  exact  line  search  is  used,  and  minimize  over  t  both  sides 
of  the  inequality  (9.17).  On  the  lefthand  side  we  get  /(tex act),  where  tex act  is  the 
step  length  that  minimizes  /.  The  rightliand  side  is  a  simple  quadratic,  which 
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is  minimized  by  t  =  1/M ,  and  has  minimum  value  f(x)  —  (l/(2M))||V/(a;)|||. 
Therefore  we  have 

f(x+)  =  /(*ex act)  <  /( x)  -  ^  ||' V(f(X)) \\\. 

Subtracting  p*  from  both  sides,  we  get 

f{x+)-p*  <  f(x)-p*  -  2^||V/(a:)||i. 

We  combine  this  with  || V/(ar) |||  >  2m(/(i)  —  p*)  (which  follows  from  (9.9))  to 
conclude 

f(x+ )  -p*  <  (1  -  m/M)(f(x)  -p*). 

Applying  this  inequality  recursively,  we  find  that 

/(s<fc>)  -p*<  ck(f(XW)  -  p*)  (9.18) 

where  c  =  1  —  m/M  <  1,  which  shows  that  f(x^)  converges  to  p*  as  k  — >  oo.  In 

particular,  we  must  have  f(x —  p*  <  e  after  at  most 

log((/(z(0))  ~P*)/e)  ,Q1cri 

log(l/c)  (  •  j 

iterations  of  the  gradient  method  with  exact  line  search. 

This  bound  on  the  number  of  iterations  required,  even  though  crude,  can  give 
some  insight  into  the  gradient  method.  The  numerator, 

log  ((/(®<°>)-p*)/e) 

can  be  interpreted  as  the  log  of  the  ratio  of  the  initial  suboptimality  (i.e.,  gap 
between  /( x ^)  and  p*),  to  the  final  suboptimality  (i.e.,  less  than  e).  This  term 
suggests  that  the  number  of  iterations  depends  on  how  good  the  initial  point  is, 
and  what  the  final  required  accuracy  is. 

The  denominator  appearing  in  the  bound  (9.19),  log(l/c),  is  a  function  of  M/m, 
which  we  have  seen  is  a  bound  on  the  condition  number  of  V2/(x)  over  S,  or  the 
condition  number  of  the  sublevel  sets  {z  |  f(z )  <  a}.  For  large  condition  number 
bound  M/m ,  we  have 

log(l/c)  =  —  log(l  —  m/M)  fts  m/M , 

so  our  bound  on  the  number  of  iterations  required  increases  approximately  linearly 
with  increasing  M/m. 

We  will  see  that  the  gradient  method  does  in  fact  require  a  large  number  of 
iterations  when  the  Hessian  of  /,  near  x* ,  has  a  large  condition  number.  Conversely, 
when  the  sublevel  sets  of  /  are  relatively  isotropic,  so  that  the  condition  number 
bound  M/m  can  be  chosen  to  be  relatively  small,  the  bound  (9.18)  shows  that 
convergence  is  rapid,  since  c  is  small,  or  at  least  not  too  close  to  one. 

The  bound  (9.18)  shows  that  the  error  f(x<'k'))  —  p*  converges  to  zero  at  least 
as  fast  as  a  geometric  series.  In  the  context  of  iterative  numerical  methods,  this 
is  called  linear  convergence,  since  the  error  lies  below  a  line  on  a  log-linear  plot  of 
error  versus  iteration  number. 
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Analysis  for  backtracking  line  search 

Now  we  consider  the  case  where  a  backtracking  line  search  is  used  in  the  gradient 
descent  method.  We  will  show  that  the  backtracking  exit  condition, 

fit)  <  fix)  ~  at\\V  f(x)\\l, 

is  satisfied  whenever  0  <  t  <  1/M.  First  note  that 

Mt2 

0  <t  <  1/M  =>  -t  +  <  -t/ 2 

(which  follows  from  convexity  of  —t+Mt2/ 2).  Using  this  result  and  the  bound  (9.17), 
we  have,  for  0  <  t  <  1/M, 

Mt2 

fit)  <  fix)-t\\Vf(x)\\2  +  —  \\Vifix))\\2 

<  fix)~it/2)\\Wf(x)\\2 

<  fix)  -  at\\Vf{x)\\l, 

since  a  <  1/2.  Therefore  the  backtracking  line  search  terminates  either  with  t  =  1 
or  with  a  value  t  >  P/M.  This  provides  a  lower  bound  on  the  decrease  in  the 
objective  function.  In  the  first  case  we  have 

fix+)  <  fix)  -  cr|| V/(rr)|||, 

and  in  the  second  case  we  have 

fix+)<f(x)-(pa/M)\\Vf(x)\\2. 

Putting  these  together,  we  always  have 

fix+)  <  fix)  -  min{a,  fia/M}\\¥ f{x)\\2- 

Now  we  can  proceed  exactly  as  in  the  case  of  exact  line  search.  We  subtract  p* 
from  both  sides  to  get 

fix+)-p*  <  fix)-p *  -  mln{ot,pa/M}\\S7 fix)\\\, 
and  combine  this  with  ||V/(a:)|||  >  2 ?n(/(a’)  —  p*)  to  obtain 

fix+)  —  p*  <  (1  —  mm{2ma,2pam/M})(f(x)  —p*)- 
From  this  we  conclude 


fixM)-p*<ckifix^)-p*) 


where 

c  =  1  —  min{2?na,  2[3am/M}  <  1. 

In  particular,  f{x^k>)  converges  to  p*  at  least  as  fast  as  a  geometric  series  with  an 
exponent  that  depends  (at  least  in  part)  on  the  condition  number  bound  M/m.  In 
the  terminology  of  iterative  methods,  the  convergence  is  at  least  linear. 
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Figure  9.2  Some  contour  lines  of  the  function  /(*)  =  (1/2) (a;?  +  ICtel).  The 
condition  number  of  the  sublevel  sets,  which  are  ellipsoids,  is  exactly  10. 
The  figure  shows  the  iterates  of  the  gradient  method  with  exact  line  search, 
started  at  a/0-1  =  (10, 1). 


9.3.2  Examples 


A  quadratic  problem  in  R2 

Our  first  example  is  very  simple.  We  consider  the  quadratic  objective  function  on 

R2 

f(x)  =  \{xl+^xl), 

where  7  >  0.  Clearly,  the  optimal  point  is  x*  =  0,  and  the  optimal  value  is  0.  The 
Hessian  of  /  is  constant,  and  has  eigenvalues  1  and  7,  so  the  condition  numbers  of 
the  sublevel  sets  of  /  are  all  exactly 


max{l,  7} 
min{l,  7} 


max{7, 1/7}. 


The  tightest  choices  for  the  strong  convexity  constants  m  and  M  are 


m  =  min{l,  7},  M  =  max{l,  7}. 


We  apply  the  gradient  descent  method  with  exact  line  search,  starting  at  the 
point  x^  =  (7, 1).  In  this  case  we  can  derive  the  following  closed-form  expressions 
for  the  iterates  x^k>  and  their  function  values  (exercise  9.6): 


(fc)  (  7-1 
x\  =  7 


„(*) 


Xn  —  - 


7-1 


and 


/(*W)  = 


7(7  +  1)  f  7-I 
7+1 


2k 


7-1 

7+1 


2k 


This  is  illustrated  in  figure  9.2,  for  7  =  10. 

For  this  simple  example,  convergence  is  exactly  linear,  i.e.,  the  error  is  exactly 
a  geometric  series,  reduced  by  the  factor  |(y  —  l)/(y  +  1)|2  at  each  iteration.  For 
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7  =  1,  the  exact  solution  is  found  in  one  iteration;  for  7  not  far  from  one  (say, 
between  1/3  and  3)  convergence  is  rapid.  The  convergence  is  very  slow  for  7  1 

or  7  <C  1. 

We  can  compare  the  convergence  with  the  bound  derived  above  in  §9.3.1.  Using 
the  least  conservative  values  m  =  min{l,7}  and  M  =  max{l,7},  the  bound  (9.18) 
guarantees  that  the  error  in  each  iteration  is  reduced  at  least  by  the  factor  c  = 
(1  —  m/M).  We  have  seen  that  the  error  is  in  fact  reduced  exactly  by  the  factor 

/ 1  —  m  /M  \  2 
\  1  +  m/M  ) 

in  each  iteration.  For  small  m/M ,  which  corresponds  to  large  condition  number, 
the  upper  bound  (9.19)  implies  that  the  number  of  iterations  required  to  obtain 
a  given  level  of  accuracy  grows  at  most  like  M/m.  For  this  example,  the  exact 
number  of  iterations  required  grows  approximately  like  (M/m)/ 4,  i.e.,  one  quarter 
of  the  value  of  the  bound.  This  shows  that  for  this  simple  example,  the  bound  on 
the  number  of  iterations  derived  in  our  simple  analysis  is  only  about  a  factor  of  four 
conservative  (using  the  least  conservative  values  for  m  and  M).  In  particular,  the 
convergence  rate  (as  well  as  its  upper  bound)  is  very  dependent  on  the  condition 
number  of  the  sublevel  sets. 

A  nonquadratic  problem  in  R2 

We  now  consider  a  nonquadratic  example  in  R2,  with 

f(x1,x2)  =  eXl+3x2~01  +  e*i-3*2-o.i  +  e-*i-o.i  (9.20) 

We  apply  the  gradient  method  with  a  backtracking  line  search,  with  a  =  0.1, 
/3  =  0.7.  Figure  9.3  shows  some  level  curves  of  /,  and  the  iterates  x ^  generated 
by  the  gradient  method  (shown  as  small  circles).  The  lines  connecting  successive 
iterates  show  the  scaled  steps, 

x(k+i)  _x(k)  =  _#)V/(a;(fe)). 

Figure  9.4  shows  the  error  f(x^)  —p*  versus  iteration  k.  The  plot  reveals  that 
the  error  converges  to  zero  approximately  as  a  geometric  series,  i.e.,  the  convergence 
is  approximately  linear.  In  this  example,  the  error  is  reduced  from  about  10  to 
about  10“ '  in  20  iterations,  so  the  error  is  reduced  by  a  factor  of  approximately 
IQ— 8/2°  ^  q  4  each  iteration.  This  reasonably  rapid  convergence  is  predicted  by 
our  convergence  analysis,  since  the  sublevel  sets  of  /  are  not  too  badly  conditioned, 
which  in  turn  means  that  M / m  can  be  chosen  as  not  too  large. 

To  compare  backtracking  line  search  with  an  exact  line  search,  we  use  the 
gradient  method  with  an  exact  line  search,  on  the  same  problem,  and  with  the 
same  starting  point.  The  results  are  given  in  figures  9.5  and  9.4.  Here  too  the 
convergence  is  approximately  linear,  about  twice  as  fast  as  the  gradient  method 
with  backtracking  line  search.  With  exact  line  search,  the  error  is  reduced  by 
about  lCU11  in  15  iterations,  i.e.,  a  reduction  by  a  factor  of  about  10-11/15  «  0.2 
per  iteration. 
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Figure  9.3  Iterates  of  the  gradient  method  with  backtracking  line  search, 
for  the  problem  in  R2  with  objective  /  given  in  (9.20).  The  dashed  curves 
are  level  curves  of  /,  and  the  small  circles  are  the  iterates  of  the  gradient 
method.  The  solid  lines,  which  connect  successive  iterates,  show  the  scaled 
steps  f(fc)  Ax(k). 


Figure  9.4  Error  /(a^fc))  _p*  versus  iteration  k  of  the  gradient  method  with 
backtracking  and  exact  line  search,  for  the  problem  in  R2  with  objective  / 
given  in  (9.20).  The  plot  shows  nearly  linear  convergence,  with  the  error 
reduced  approximately  by  the  factor  0.4  in  each  iteration  of  the  gradient 
method  with  backtracking  line  search,  and  by  the  factor  0.2  in  each  iteration 
of  the  gradient  method  with  exact  line  search. 
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Figure  9.5  Iterates  of  the  gradient  method  with  exact  line  search  for  the 
problem  in  R2  with  objective  /  given  in  (9.20). 


A  problem  in  R100 

We  next  consider  a  larger  example,  of  the  form 

m 

f(x)=cTx-^\og(bl-aJx),  (9.21) 

i=l 

with  m  =  500  terms  and  n  =  100  variables. 

The  progress  of  the  gradient  method  with  backtracking  line  search,  with  pa¬ 
rameters  a  =  0.1,  (3  =  0.5,  is  shown  in  figure  9.6.  In  this  example  we  see  an  initial 
approximately  linear  and  fairly  rapid  convergence  for  about  20  iterations,  followed 
by  a  slower  linear  convergence.  Overall,  the  error  is  reduced  by  a  factor  of  around 
106  in  around  175  iterations,  which  gives  an  average  error  reduction  by  a  factor  of 
around  10-6' 175  ss  0.92  per  iteration.  The  initial  convergence  rate,  for  the  first  20 
iterations,  is  around  a  factor  of  0.8  per  iteration;  the  slower  final  convergence  rate, 
after  the  first  20  iterations,  is  around  a  factor  of  0.94  per  iteration. 

Figure  9.6  shows  the  convergence  of  the  gradient  method  with  exact  line  search. 
The  convergence  is  again  approximately  linear,  with  an  overall  error  reduction  by 
approximately  a  factor  10~6/140  ps  0.91  per  iteration.  This  is  only  a  bit  faster  than 
the  gradient  method  with  backtracking  line  search. 

Finally,  we  examine  the  influence  of  the  backtracking  line  search  parameters  a 
and  /3  on  the  convergence  rate,  by  determining  the  number  of  iterations  required 
to  obtain  f(x^)  —  p*  <  10  5.  In  the  first  experiment,  we  fix  (3  =  0.5,  and  vary 
a  from  0.05  to  0.5.  The  number  of  iterations  required  varies  from  about  80,  for 
larger  values  of  a,  in  the  range  0.2-0. 5,  to  about  170  for  smaller  values  of  a.  This, 
and  other  experiments,  suggest  that  the  gradient  method  works  better  with  fairly 
large  a,  in  the  range  0.2-0. 5. 

Similarly,  we  can  study  the  effect  of  the  choice  of  (3  by  fixing  a  =  0.1  and 
varying  f3  from  0.05  to  0.95.  Again  the  variation  in  the  total  number  of  iterations 
is  not  large,  ranging  from  around  80  (when  [3  ps  0.5)  to  around  200  (for  /?  small, 
or  near  1).  This  experiment,  and  others,  suggest  that  /?  ps  0.5  is  a  good  choice. 
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Figure  9.6  Error  f(x^)—p*  versus  iteration  k  for  the  gradient  method  with 
backtracking  and  exact  line  search,  for  a  problem  in  R1()0. 


These  experiments  suggest  that  the  effect  of  the  backtracking  parameters  on  the 
convergence  is  not  large,  no  more  than  a  factor  of  two  or  so. 

Gradient  method  and  condition  number 

Our  last  experiment  will  illustrate  the  importance  of  the  condition  number  of 
X72f(x)  (or  the  sublevel  sets)  on  the  rate  of  convergence  of  the  gradient  method. 
We  start  with  the  function  given  by  (9.21),  but  replace  the  variable  x  by  x  =  Tx, 
where 

T  =  diag((l,71/",72/”,...,7('1-1)/")), 


z.e.,  we  minimize 

m 

f(x)=crTx  —  '^^log(bi  —  aJ’Tx).  (9.22) 

i- 1 

This  gives  us  a  family  of  optimization  problems,  indexed  by  7,  which  affects  the 
problem  condition  number. 

Figure  9.7  shows  the  number  of  iterations  required  to  achieve  /  (x^)—p*  <  10-5 
as  a  function  of  7,  using  a  backtracking  line  search  with  a  =  0.3  and  /?  =  0.7.  This 
plot  shows  that  for  diagonal  scaling  as  small  as  10  :  1  ( i.e .,  7  =  10),  the  number  of 
iterations  grows  to  more  than  a  thousand;  for  a  diagonal  scaling  of  20  or  more,  the 
gradient  method  slows  to  essentially  useless. 

The  condition  number  of  the  Hessian  V2/( x*)  at  the  optimum  is  shown  in 
figure  9.8.  For  large  and  small  7,  the  condition  number  increases  roughly  as 
max{72,  l/72},  in  a  very  similar  way  as  the  number  of  iterations  depends  on  7. 
This  shows  again  that  the  relation  between  conditioning  and  convergence  speed  is 
a  real  phenomenon,  and  not  just  an  artifact  of  our  analysis. 


474 


9  Unconstrained  minimization 


Figure  9.7  Number  of  iterations  of  the  gradient  method  applied  to  prob¬ 
lem  (9.22).  The  vertical  axis  shows  the  number  of  iterations  required  to 
obtain  f(x —  p*  <  1CP5.  The  horizontal  axis  shows  7,  which  is  a  param¬ 
eter  that  controls  the  amount  of  diagonal  scaling.  We  use  a  backtracking 
line  search  with  a  =  0.3,  /3  =  0.7. 


Figure  9.8  Condition  number  of  the  Hessian  of  the  function  at  its  minimum, 
as  a  function  of  7.  By  comparing  this  plot  with  the  one  in  figure  9.7,  we  see 
that  the  condition  number  has  a  very  strong  influence  on  convergence  rate. 
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Conclusions 

From  the  numerical  examples  shown,  and  others,  we  can  make  the  conclusions 
summarized  below. 

•  The  gradient  method  often  exhibits  approximately  linear  convergence,  i.e., 
the  error  f(x^)  —  p*  converges  to  zero  approximately  as  a  geometric  series. 

•  The  choice  of  backtracking  parameters  a ,  j3  has  a  noticeable  but  not  dramatic 
effect  on  the  convergence.  An  exact  line  search  sometimes  improves  the  con¬ 
vergence  of  the  gradient  method,  but  the  effect  is  not  large  (and  probably 
not  worth  the  trouble  of  implementing  the  exact  line  search). 

•  The  convergence  rate  depends  greatly  on  the  condition  number  of  the  Hessian, 
or  the  sublevel  sets.  Convergence  can  be  very  slow,  even  for  problems  that  are 
moderately  well  conditioned  (say,  with  condition  number  in  the  100s).  When 
the  condition  number  is  larger  (say,  1000  or  more)  the  gradient  method  is  so 
slow  that  it  is  useless  in  practice. 

The  main  advantage  of  the  gradient  method  is  its  simplicity.  Its  main  disadvantage 
is  that  its  convergence  rate  depends  so  critically  on  the  condition  number  of  the 
Hessian  or  sublevel  sets. 


9.4  Steepest  descent  method 

The  first-order  Taylor  approximation  of  f(x  +  v)  around  x  is 
f(x  +  v)  «  f(x  +  v)  =  f(x)  +  \7f(x)Tv. 

The  second  term  on  the  righthand  side,  V f(x)Tv,  is  the  directional  derivative  of 
/  at  x  in  the  direction  v.  It  gives  the  approximate  change  in  /  for  a  small  step  v. 
The  step  v  is  a  descent  direction  if  the  directional  derivative  is  negative. 

We  now  address  the  question  of  how  to  choose  v  to  make  the  directional  deriva¬ 
tive  as  negative  as  possible.  Since  the  directional  derivative  V f(x)T v  is  linear  in 
v,  it  can  be  made  as  negative  as  we  like  by  taking  v  large  (provided  v  is  a  descent 
direction,  i.e.,  V/( x)Tv  <  0).  To  make  the  question  sensible  we  have  to  limit  the 
size  of  v,  or  normalize  by  the  length  of  v. 

Let  ||  •  ||  be  any  norm  on  R".  We  define  a  normalized  steepest  descent  direction 
(with  respect  to  the  norm  ||  •  ||)  as 

A.Tnsd  =  argmin{ V / (x)Tv  |  ||u||  =  1}.  (9.23) 

(We  say  ‘a’  steepest  descent  direction  because  there  can  be  multiple  minimizers.) 
A  normalized  steepest  descent  direction  Aa;nsd  is  a  step  of  unit  norm  that  gives  the 
largest  decrease  in  the  linear  approximation  of  /. 

A  normalized  steepest  descent  direction  can  be  interpreted  geometrically  as 
follows.  We  can  just  as  well  define  Aa;nsd  as 

A.xnsd  =  argmin{ V/ (x)Tv  |  ||u||  <  1}, 
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i.e.,  as  the  direction  in  the  unit  ball  of  ||  •  ||  that  extends  farthest  in  the  direction 

-V/(z). 

It  is  also  convenient  to  consider  a  steepest  descent  step  A:rsd  that  is  unnormal¬ 
ized ,  by  scaling  the  normalized  steepest  descent  direction  in  a  particular  way: 

Axsd  =  ||V/(a:)||*Aa;nsd,  (9.24) 

where  ||  •  ||*  denotes  the  dual  norm.  Note  that  for  the  steepest  descent  step,  we 
have 

Vf(x)T  Axsd  =  \\Vf(x)\UVf(x)TAxnsd  =  — 1|  V/(x)||* 

(see  exercise  9.7). 

The  steepest  descent  method  uses  the  steepest  descent  direction  as  search  direc¬ 
tion. 


Algorithm  9.4  Steepest  descent  method. 

given  a  starting  point  x  £  dom /. 

repeat 

1.  Compute  steepest  descent  direction  Axsd- 

2.  Line  search.  Choose  t  via  backtracking  or  exact  line  search. 

3.  Update.  x:=x  +  tAxsd- 
until  stopping  criterion  is  satisfied. 


When  exact  line  search  is  used,  scale  factors  in  the  descent  direction  have  no  effect, 
so  the  normalized  or  unnornralized  direction  can  be  used. 


9.4.1  Steepest  descent  for  Euclidean  and  quadratic  norms 

Steepest  descent  for  Euclidean  norm 

If  we  take  the  norm  ||  •  ||  to  be  the  Euclidean  norm  we  find  that  the  steepest  descent 
direction  is  simply  the  negative  gradient,  i.e.,  Axsd  =  —  V/(x).  The  steepest 
descent  method  for  the  Euclidean  norm  coincides  with  the  gradient  descent  method. 

Steepest  descent  for  quadratic  norm 

We  consider  the  quadratic  norm 

\\z\\p  =  (zTPz)1/2  =  \\P1/2z\\2, 

where  P  €  S” +  .  The  normalized  steepest  descent  direction  is  given  by 
Axnsd  =  -  (V/(x)TP_1V/(a;))~1/2  P~XV f(x). 

The  dual  norm  is  given  by  ||z||*  =  ||P_1/2z||2,  so  the  steepest  descent  step  with 
respect  to  ||  •  ||p  is  given  by 

Aa:sd  =  — P_1V/(:r).  (9.25) 

The  normalized  steepest  descent  direction  for  a  quadratic  norm  is  illustrated  in 
figure  9.9. 
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Figure  9.9  Normalized  steepest  descent  direction  for  a  quadratic  norm.  The 
ellipsoid  shown  is  the  unit  ball  of  the  norm,  translated  to  the  point  x.  The 
normalized  steepest  descent  direction  Aa;nsd  at  x  extends  as  far  as  possible 
in  the  direction  —  V f(x)  while  staying  in  the  ellipsoid.  The  gradient  and 
normalized  steepest  descent  directions  are  shown. 


Interpretation  via  change  of  coordinates 

We  can  give  an  interesting  alternative  interpretation  of  the  steepest  descent  direc¬ 
tion  Aa:s(j  as  the  gradient  search  direction  after  a  change  of  coordinates  is  applied 
to  the  problem.  Define  u  =  P 1/2u,  so  we  have  ||w||p  =  1 1 zZ 1 1 2 -  Using  this  change 
of  coordinates,  we  can  solve  the  original  problem  of  minimizing  /  by  solving  the 
equivalent  problem  of  minimizing  the  function  /  :  R™  — >  R,  given  by 

/(«)  =  /(P'1/2u)  =  f(u). 

If  we  apply  the  gradient  method  to  /,  the  search  direction  at  a  point  x  (which 
corresponds  to  the  point  x  =  P^^x  for  the  original  problem)  is 

Ax  =  — V/(x)  =  -P_1/2V/(P_1/2i)  =  -P"1/2V/(  x). 

This  gradient  search  direction  corresponds  to  the  direction 

A:r  =  P"1/2  (-P-^V/Or))  =  -P-^fix) 

for  the  original  variable  x.  In  other  words,  the  steepest  descent  method  in  the 
quadratic  norm  ||  •  ||p  can  be  thought  of  as  the  gradient  method  applied  to  the 
problem  after  the  change  of  coordinates  x  =  P1/2x. 


9.4.2  Steepest  descent  for  £i-norm 

As  another  example,  we  consider  the  steepest  descent  method  for  the  fi-norm.  A 
normalized  steepest  descent  direction, 

Aa;„sd  =  argmin{ V f(x)Tv  \  ||c||i  <  1}, 
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-V/(*) 


Figure  9.10  Normalized  steepest  descent  direction  for  the  £i-norm.  The 
diamond  is  the  unit  ball  of  the  td-norm,  translated  to  the  point  x.  The 
normalized  steepest  descent  direction  can  always  be  chosen  in  the  direction 
of  a  standard  basis  vector;  in  this  example  we  have  Axnsd  =  ei. 


is  easily  characterized.  Let  i  be  any  index  for  which  ||V/(a;)||00  =  |(V/(x))i|.  Then 
a  normalized  steepest  descent  direction  Axnsd  for  the  f^-nornr  is  given  by 


Axnsd  =  —sign 


where  is  the  ith  standard  basis  vector.  An  unnormalized  steepest  descent  step 
is  then 

Axsd  =  Axnsd||  V/(x)||00  = 

OXi 

Thus,  the  normalized  steepest  descent  step  in  Id-nornr  can  always  be  chosen  to  be  a 
standard  basis  vector  (or  a  negative  standard  basis  vector).  It  is  the  coordinate  axis 
direction  along  which  the  approximate  decrease  in  /  is  greatest.  This  is  illustrated 
in  figure  9.10. 

The  steepest  descent  algorithm  in  the  Id -norm  has  a  very  natural  interpretation: 
At  each  iteration  we  select  a  component  of  V/(x)  with  maximum  absolute  value, 
and  then  decrease  or  increase  the  corresponding  component  of  x,  according  to  the 
sign  of  (V/(x))i.  The  algorithm  is  sometimes  called  a  coordinate-descent  algorithm, 
since  only  one  component  of  the  variable  x  is  updated  at  each  iteration.  This  can 
greatly  simplify,  or  even  trivialize,  the  line  search. 


Example  9.2  Frobenius  norm  scaling.  In  §4.5.4  we  encountered  the  unconstrained 
geometric  program 

minimize  y" 

where  M  €  Rnxn  js  given,  and  the  variable  is  d  £  R".  Using  the  change  of  variables 
Xi  =  2  log  di  we  can  express  this  geometric  program  in  convex  form  as 

minimize  f(x)  =  log  . 
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It  is  easy  to  minimize  /  one  component  at  a  time.  Keeping  all  components  except 
the  fcth  fixed,  we  can  write  f(x)  =  log(afc  +  /3ke~Xk  +  7 keXk),  where 

ak  =  Mlk  +  Mfj eXi~xi ,  Pk  =  ^2  Mik e*S  7 k  =  Y^  Mki e~Xi- 

i,j^k  i^k 

The  minimum  of  f(x),  as  a  function  of  Xk,  is  obtained  for  Xk  =  log(/3fc/7fc)/2-  So 
for  this  problem  an  exact  line  search  can  be  carried  out  using  a  simple  analytical 
formula. 

The  fi-steepest  descent  algorithm  with  exact  line  search  consists  of  repeating  the 
following  steps. 

1.  Compute  the  gradient 


—/3je~Xi  +  7^ 
cti  +  Pie~Xi  +  7ieXi  ’ 


i  =  1, . . . ,  n. 


2.  Select  a  largest  (in  absolute  value)  component  of  V/ (a:):  |V/(a:)|fc  =  ||V/(*)||00. 

3.  Minimize  /  over  the  scalar  variable  xk ,  by  setting  xk  =  log(/3fc/7fc)/2. 


9.4.3  Convergence  analysis 

In  this  section  we  extend  the  convergence  analysis  for  the  gradient  method  with 
backtracking  line  search  to  the  steepest  descent  method  for  an  arbitrary  norm.  We 
will  use  the  fact  that  any  norm  can  be  bounded  in  terms  of  the  Euclidean  norm, 
so  there  exists  constants  7,  7  £  (0, 1]  such  that 

Ml  >  7IMI2,  M|.  >  7IMI2 


(see  §A.1.4). 

Again  we  assume  /  is  strongly  convex  on  the  initial  sublevel  set  S.  The  upper 
bound  V2/(a;)  A  MI  implies  an  upper  bound  on  the  function  /( x  +  tAxsd)  as  a 
function  of  t : 


/( x  +  tAxsd) 


The  step  size  t  =  y2/M 
satisfies  the  exit  condition 

/( x  +  tAxsd)  <  f(x) 


<  f(x)  +  tVf(x)TAxsd  + 

<  f(x)  +  t\7f(x)T  Axsd  + 

M 


M||AxsdJ||  2 

2 

Af||Aa;sd||2  ,2 


2y2 


(which  minimizes  the  quadratic  upper  bound 
for  the  backtracking  line  search: 

-  ^l|V/(a;)||2  <  f(x)  +  ^-V/(a')T Axsd 


(9.26) 
(9.26)) 

(9.27) 
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since  a  <  1/2  and  X7  f(x)T  Axsd  =  —  ||V/(a;)||2.  The  line  search  therefore  returns  a 
step  size  t  >  nhn{l,  (D^/M},  and  we  have 

f(x+)  =  f(x +  tAxs,  d)  <  f(x)  —  a  min{l,  Pj2  /M}\\S7  f(x)\\2 

<  f(x)  -  aj2  min{l,  p"/2 /M}\\\7 f(x)\\22. 

Subtracting  p*  from  both  sides  and  using  (9.9),  we  obtain 

f(x+)-p *  <  c(f(x)-p*), 

where 

c  =  1  —  2ma'y2  min{l,  /3j2  /M}  <  1. 

Therefore  we  have 

f(xW)-p*  <ck(f(XM)-p*), 
i.e.,  linear  convergence  exactly  as  in  the  gradient  method. 


9.4.4  Discussion  and  examples 

Choice  of  norm  for  steepest  descent 

The  choice  of  norm  used  to  define  the  steepest  descent  direction  can  have  a  dra¬ 
matic  effect  on  the  convergence  rate.  For  simplicity,  we  consider  the  case  of  steep¬ 
est  descent  with  quadratic  P-norm.  In  §9.4.1,  we  showed  that  the  steepest  descent 
method  with  quadratic  P-norm  is  the  same  as  the  gradient  method  applied  to  the 
problem  after  the  change  of  coordinates  x  =  Px!2x.  We  know  that  the  gradient 
method  works  well  when  the  condition  numbers  of  the  sublevel  sets  (or  the  Hes¬ 
sian  near  the  optimal  point)  are  moderate,  and  works  poorly  when  the  condition 
numbers  are  large.  It  follows  that  when  the  sublevel  sets,  after  the  change  of  coor¬ 
dinates  x  =  P1/2#,  are  moderately  conditioned,  the  steepest  descent  method  will 
work  well. 

This  observation  provides  a  prescription  for  choosing  P:  It  should  be  chosen 
so  that  the  sublevel  sets  of  /,  transformed  by  P”1/2,  are  well  conditioned.  For 
example  if  an  approximation  H  of  the  Hessian  at  the  optimal  point  H(x*)  were 
known,  a  very  good  choice  of  P  would  be  P  =  H,  since  the  Hessian  of  /  at  the 
optimum  is  then 

p-1/2V2/(a:*)P'“i/2  _  j 

and  so  is  likely  to  have  a  low  condition  number. 

This  same  idea  can  be  described  without  a  change  of  coordinates.  Saying  that 
a  sublevel  set  has  low  condition  number  after  the  change  of  coordinates  x  =  Pl^2x 
is  the  same  as  saying  that  the  ellipsoid 

£  =  {x\  xT  Px  <  1} 

approximates  the  shape  of  the  sublevel  set.  (In  other  words,  it  gives  a  good  ap¬ 
proximation  after  appropriate  scaling  and  translation.) 

This  dependence  of  the  convergence  rate  on  the  choice  of  P  can  be  viewed  from 
two  sides.  The  optimist’s  viewpoint  is  that  for  any  problem,  there  is  always  a 
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Figure  9.11  Steepest  descent  method  with  a  quadratic  norm  ||  •  ||pj.  The 
ellipses  are  the  boundaries  of  the  norm  balls  {x  |  ||*  —  ^  1}  at 

and  x^. 


choice  of  P  for  which  the  steepest  descent  method  works  very  well.  The  challenge, 
of  course,  is  to  find  such  a  P.  The  pessimist’s  viewpoint  is  that  for  any  problem, 
there  are  a  huge  number  of  choices  of  P  for  which  steepest  descent  works  very 
poorly.  In  summary,  we  can  say  that  the  steepest  descent  method  works  well  in 
cases  where  we  can  identify  a  matrix  P  for  which  the  transformed  problem  has 
moderate  condition  number. 


Examples 

In  this  section  we  illustrate  some  of  these  ideas  using  the  nonquadratic  problem  in 
R2  with  objective  function  (9.20).  We  apply  the  steepest  descent  method  to  the 
problem,  using  the  two  quadratic  norms  defined  by 


Pi 


2  0 
0  8  ’ 


Pi  = 


8  0 
0  2 


In  both  cases  we  use  a  backtracking  line  search  with  a  =  0.1  and  /3  =  0.7. 

Figures  9.11  and  9.12  show  the  iterates  for  steepest  descent  with  norm  |  ■  1 1  p,  and 
norm  ||  •  ||p2.  Figure  9.13  shows  the  error  versus  iteration  number  for  both  norms. 
Figure  9.13  shows  that  the  choice  of  norm  strongly  influences  the  convergence. 
With  the  norm  ||  •  ||p1?  convergence  is  a  bit  more  rapid  than  the  gradient  method, 
whereas  with  the  norm  ||  •  ||p2,  convergence  is  far  slower. 

This  can  be  explained  by  examining  the  problems  after  the  changes  of  coor- 
dinates  x  =  Px  x  and  x  =  P2  x,  respectively.  Figures  9.14  and  9.15  show  the 
problems  in  the  transformed  coordinates.  The  change  of  variables  associated  with 
Pi  yields  sublevel  sets  with  modest  condition  number,  so  convergence  is  fast.  The 
change  of  variables  associated  with  P2  yields  sublevel  sets  that  are  more  poorly 
conditioned,  which  explains  the  slower  convergence. 
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Figure  9.13  Error  f(x —  p*  versus  iteration  k,  for  the  steepest  descent 
method  with  the  quadratic  norm  ||  •  1^  and  the  quadratic  norm  ||  •  ||p2. 
Convergence  is  rapid  for  the  norm  ||  •  ||p2  and  very  slow  for  ||  •  ||p2. 
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Figure  9.14  The  iterates  of  steepest  descent  with  norm  ||  •  ||p1;  after  the 
change  of  coordinates.  This  change  of  coordinates  reduces  the  condition 
number  of  the  sublevel  sets,  and  so  speeds  up  convergence. 


Figure  9.15  The  iterates  of  steepest  descent  with  norm  ||  •  ||p2,  after  the 
change  of  coordinates.  This  change  of  coordinates  increases  the  condition 
number  of  the  sublevel  sets,  and  so  slows  down  convergence. 
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Figure  9.16  The  function  /  (shown  solid)  and  its  second-order  approximation 
f  at  x  (dashed).  The  Newton  step  Axnt  is  what  must  be  added  to  x  to  give 
the  minimize!’  of  /. 


9.5  Newton’s  method 

9.5.1  The  Newton  step 

For  x  £  dom /,  the  vector 


Axnt  =  —V2f(x)  1Vf(  x) 

is  called  the  Newton  step  (for  /,  at  x).  Positive  definiteness  of  X72f(x)  implies  that 

N.f(x)TAxnt  =  — V/(a;)TV2/(^)-1V/( x)  <  0 

unless  V/( x)  =  0,  so  the  Newton  step  is  a  descent  direction  (unless  x  is  optimal). 
The  Newton  step  can  be  interpreted  and  motivated  in  several  ways. 

Minimizer  of  second-order  approximation 

The  second-order  Taylor  approximation  (or  model)  /  of  /  at  x  is 

f(x  +  v)  =  f(x)  +  \7f(x)Tv  +  ^vTV2f(x)v,  (9.28) 

which  is  a  convex  quadratic  function  of  v,  and  is  minimized  when  v  =  Axnt.  Thus, 
the  Newton  step  Accnt  is  what  should  be  added  to  the  point  x  to  minimize  the 
second-order  approximation  of  /  at  x.  This  is  illustrated  in  figure  9.16. 

This  interpretation  gives  us  some  insight  into  the  Newton  step.  If  the  function 
/  is  quadratic,  then  x  +  Aint  is  the  exact  minimizer  of  /.  If  the  function  /  is 
nearly  quadratic,  intuition  suggests  that  x  +  Aa;nt  should  be  a  very  good  estimate 
of  the  minimizer  of  /,  i.e.,  x* .  Since  /  is  twice  differentiable,  the  quadratic  model 
of  /  will  be  very  accurate  when  x  is  near  x*.  It  follows  that  when  x  is  near  x*, 
the  point  x  +  Aa;„t  should  be  a  very  good  estimate  of  x*.  We  will  see  that  this 
intuition  is  correct. 
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Figure  9.17  The  dashed  lines  are  level  curves  of  a  convex  function.  The 
ellipsoid  shown  (with  solid  line)  is  {x  +  v  \  vTV2 f(x)v  <  1}.  The  arrow 
shows  —  V/(x),  the  gradient  descent  direction.  The  Newton  step  Aa;nt  is 
the  steepest  descent  direction  in  the  norm  ||  •  ||v2/(a;)-  The  figure  also  shows 
Aa;nsd,  the  normalized  steepest  descent  direction  for  the  same  norm. 


Steepest  descent  direction  in  Hessian  norm 

The  Newton  step  is  also  the  steepest  descent  direction  at  x,  for  the  quadratic  norm 
defined  by  the  Hessian  X72f(x),  i.e., 

IMIv2/(x)  =  (uTV2f{x)u)1/2. 

This  gives  another  insight  into  why  the  Newton  step  should  be  a  good  search 
direction,  and  a  very  good  search  direction  when  x  is  near  x* . 

Recall  from  our  discussion  above  that  steepest  descent,  with  quadratic  norm 
||  •  ||  p,  converges  very  rapidly  when  the  Hessian,  after  the  associated  change  of 
coordinates,  has  small  condition  number.  In  particular,  near  x* ,  a  very  good  choice 
is  P  =  V2 /(&*).  When  x  is  near  x*,  we  have  V2/(x)  ss  V2/(cc*),  which  explains 
why  the  Newton  step  is  a  very  good  choice  of  search  direction.  This  is  illustrated 
in  figure  9.17. 

Solution  of  linearized  optimality  condition 

If  we  linearize  the  optimality  condition  V/( x*)  =  0  near  x  we  obtain 
V/(®  +  v)  «  V/(x)  +  V2f(x)v  =  0, 

which  is  a  linear  equation  in  v,  with  solution  v  =  Axnt.  So  the  Newton  step  Aint  is 
what  must  be  added  to  x  so  that  the  linearized  optimality  condition  holds.  Again, 
this  suggests  that  when  x  is  near  x *  (so  the  optimality  conditions  almost  hold), 
the  update  x  +  Axnt  should  be  a  very  good  approximation  of  x* . 

When  n  =  1,  i.e.,  f  :  R  — >  R,  this  interpretation  is  particularly  simple.  The 
solution  x *  of  the  minimization  problem  is  characterized  by  f'(x*)  =  0,  i.e.,  it  is 
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Figure  9.18  The  solid  curve  is  the  derivative  /'  of  the  function  /  shown  in 
figure  9.16.  f'  is  the  linear  approximation  of  f  at  x.  The  Newton  step  A*nt 
is  the  difference  between  the  root  of  /'  and  the  point  x. 


the  zero-crossing  of  the  derivative  f ,  which  is  monotonically  increasing  since  /  is 
convex.  Given  our  current  approximation  x  of  the  solution,  we  form  a  first-order 
Taylor  approximation  of  f  at  x.  The  zero-crossing  of  this  affine  approximation  is 
then  x  +  Axnt •  This  interpretation  is  illustrated  in  figure  9.18. 

Affine  invariance  of  the  Newton  step 

An  important  feature  of  the  Newton  step  is  that  it  is  independent  of  linear  (or 
affine)  changes  of  coordinates.  Suppose  T  £  Rnxn  nonsingular,  and  define 
f(y)  =  f(Ty)-  Then  we  have 

V/(y)  =  TTV/(x),  V2/(y)  =  TtV2  f(x)T, 

where  x  =  Ty.  The  Newton  step  for  /  at  y  is  therefore 

A  ynt  =  —  (TtV2/(x)T)_1  (TtV  f(x)) 

=  — T_1V2/(a;)_1V/(a;) 

=  T^1Aa,„t, 

where  Aa:nt  is  the  Newton  step  for  /  at  x.  Hence  the  Newton  steps  of  /  and  /  are 
related  by  the  same  linear  transformation,  and 

x  +  Aa:nt  =  T(y  +  A  ynt). 


The  Newton  decrement 

The  quantity 

X(x)  =  (V  /  (a;)T  V2/(a;)_1  V/ (x)) 1/2 

is  called  the  Newton  decrement  at  x.  We  will  see  that  the  Newton  decrement 
plays  an  important  role  in  the  analysis  of  Newton’s  method,  and  is  also  useful 
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as  a  stopping  criterion.  We  can  relate  the  Newton  decrement  to  the  quantity 
f(x)  —  infy  f(y),  where  /  is  the  second-order  approximation  of  f  at  x: 

f(x)  -  inf  f(y)  =  f(x)  -  f(x  +  Axnt)  =  \ X(x )2. 
v  2 

Thus,  A2/2  is  an  estimate  of  f(x)  —  p*,  based  on  the  quadratic  approximation  of  / 
at  x. 

We  can  also  express  the  Newton  decrement  as 

\{x)  =  (A.XntV2/(a;)Aa;nt)1/“  .  (9.29) 

This  shows  that  A  is  the  norm  of  the  Newton  step,  in  the  ciuadratic  norm  defined 
by  the  Hessian,  i.e.,  the  norm 

IMIv2/(x)  =  (i iTV2f(x)u)1/2  . 

The  Newton  decrement  comes  up  in  backtracking  line  search  as  well,  since  we  have 

Vf(x)T  Axnt  =  —A(x)2.  (9.30) 


This  is  the  constant  used  in  a  backtracking  line  search,  and  can  be  interpreted  as 
the  directional  derivative  of  /  at  x  in  the  direction  of  the  Newton  step: 


-A(a;)2  =  X7  f{x)T  Axnt 


d_ 

dt 


f(x  +  Axnt  t) 


t= 0 


Finally,  we  note  that  the  Newton  decrement  is,  like  the  Newton  step,  affine  in¬ 
variant.  In  other  words,  the  Newton  decrement  of  f(y)  =  f{Ty )  at  y,  where  T  is 
nonsingular,  is  the  same  as  the  Newton  decrement  of  /  at  x  =  Ty. 


9.5.2  Newton’s  method 

Newton’s  method,  as  outlined  below,  is  sometimes  called  the  damped  Newton 
method  or  guarded  Newton  method,  to  distinguish  it  from  the  pure  Newton  method, 
which  uses  a  fixed  step  size  t  =  1. 


Algorithm  9.5  Newton’s  method. 

given  a  starting  point  x  £  dom  /,  tolerance  e  >  0. 

repeat 

1.  Compute  the  Newton  step  and  decrement. 

A*nt  :=  -V2/(aO_1V/(: r);  A2  :=  V/(x)TV2/(x)-x V/(x). 

2.  Stopping  criterion,  quit  if  A2/2  <  e. 

3.  Line  search.  Choose  step  size  t  by  backtracking  line  search. 

4.  Update,  x  :=  x  +  tAxnt- 


This  is  essentially  the  general  descent  method  described  in  §9.2,  using  the  New¬ 
ton  step  as  search  direction.  The  only  difference  (which  is  very  minor)  is  that  the 
stopping  criterion  is  checked  after  computing  the  search  direction,  rather  than  after 
the  update. 
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9.5.3  Convergence  analysis 

We  assume,  as  before,  that  /  is  twice  continuously  differentiable,  and  strongly 
convex  with  constant  m,  i.e.,  V2/(x)  y  ml  for  x  £  S.  We  have  seen  that  this  also 
implies  that  there  exists  an  M  >  0  such  that  V2  f(x)  <MI  for  all  x  £  S. 

In  addition,  we  assume  that  the  Hessian  of  f  is  Lipschitz  continuous  on  S  with 
constant  L,  i.e., 

l|V2/(z)-V2/(7/)||2<L||:r-7/||2  (9.31) 

for  all  x,  y  £  S.  The  coefficient  L ,  which  can  be  interpreted  as  a  bound  on  the 
third  derivative  of  /,  can  be  taken  to  be  zero  for  a  quadratic  function.  More 
generally  L  measures  how  well  /  can  be  approximated  by  a  quadratic  model,  so 
we  can  expect  the  Lipschitz  constant  L  to  play  a  critical  role  in  the  performance 
of  Newton’s  method.  Intuition  suggests  that  Newton’s  method  will  work  very  well 
for  a  function  whose  quadratic  model  varies  slowly  (i.e.,  has  small  L). 

Idea  and  outline  of  convergence  proof 

We  first  give  the  idea  and  outline  of  the  convergence  proof,  and  the  main  conclusion, 
and  then  the  details  of  the  proof.  We  will  show  there  are  numbers  77  and  7  with 
0  <  77  <  777 2 / L  and  7  >  0  such  that  the  following  hold. 

•  If  ||  V/(a;(fc))||2  >  77,  then 

/(a<fc+1)) -/(*<*))< -7.  (9.32) 


•  If  || V/(a;(*9)||2  <  77,  then  the  backtracking  line  search  selects  t ^  =  1  and 


L 

2m2 


||V/(a;(fc+1>)||2  < 


L 

2m2 


Iiv/O 


(9.33) 


Let  us  analyze  the  implications  of  the  second  condition.  Suppose  that  it 
is  satisfied  for  iteration  k,  i.e.,  ||V/(a;(*9)||2  <  77.  Since  77  <  m2/L,  we  have 
|| V/(x^fe+1^)||2  <  77,  i.e.,  the  second  condition  is  also  satisfied  at  iteration  k  +  1. 
Continuing  recursively,  we  conclude  that  once  the  second  condition  holds,  it  will 
hold  for  all  future  iterates,  i.e.,  for  all  l  >  k,  we  have  ||V/(a;^)||2  <  77.  Therefore 
for  all  l  >  k,  the  algorithm  takes  a  full  Newton  step  t  =  1,  and 


r||V/(a:<'+1>)||2  <  f^||V/(* 


(ih 


2m2  11  ”  \2m2 

Applying  this  inequality  recursively,  we  find  that  for  l  >  k, 

2l~k  /  1  \  21 


L  —  -  (Or  '  (  L 


^iiv/r>)n2<i^iiv/r>)ii2 


l  —  k  ryl  —  k 

<  [  - 


and  hence 


/r>)-p*<^iiv/r>)|ir^(i'3 


(9.34) 


(9.35) 
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This  last  inequality  shows  that  convergence  is  extremely  rapid  once  the  second 
condition  is  satisfied.  This  phenomenon  is  called  quadratic  convergence.  Roughly 
speaking,  the  inequality  (9.35)  means  that,  after  a  sufficiently  large  number  of 
iterations,  the  number  of  correct  digits  doubles  at  each  iteration. 

The  iterations  in  Newton’s  method  naturally  fall  into  two  stages.  The  second 
stage,  which  occurs  once  the  condition  ||  V/(ar) || 2  <  V  holds,  is  called  the  quadrat¬ 
ically  convergent  stage.  We  refer  to  the  first  stage  as  the  damped  Newton  phase , 
because  the  algorithm  can  choose  a  step  size  t  <  1.  The  quadratically  convergent 
stage  is  also  called  the  pure  Newton  phase,  since  in  these  iterations  a  step  size  t  =  1 
is  always  chosen. 

Now  we  can  estimate  the  total  complexity.  First  we  derive  an  upper  bound  on 
the  number  of  iterations  in  the  damped  Newton  phase.  Since  /  decreases  by  at 
least  7  at  each  iteration,  the  number  of  damped  Newton  steps  cannot  exceed 

f(xW)-p* 

7 

since  if  it  did,  /  would  be  less  than  p ■*,  which  is  impossible. 

We  can  bound  the  number  of  iterations  in  the  quadratically  convergent  phase 
using  the  inequality  (9.35).  It  implies  that  we  must  have  f(x)  —  p*  <  e  after  no 
more  than 

log2  log2(e0/e) 

iterations  in  the  quadratically  convergent  phase,  where  e0  =  2 m3 /L2. 

Overall,  then,  the  number  of  iterations  until  /( x)  —p*<e  is  bounded  above  by 

— — - — —  +  log2  log2(e0/e).  (9.36) 

7 

The  term  log2  log2  (eo  / e) ,  which  bounds  the  number  of  iterations  in  the  quadrati¬ 
cally  convergent  phase,  grows  extremely  slowly  with  required  accuracy  e,  and  can 
be  considered  a  constant  for  practical  purposes,  say  five  or  six.  (Six  iterations  of 
the  quadratically  convergent  stage  gives  an  accuracy  of  about  e  «  5  •  10_2Oeo.) 

Not  quite  accurately,  then,  we  can  say  that  the  number  of  Newton  iterations 
required  to  minimize  /  is  bounded  above  by 


/(z(0))  -p* 
7 


+  6. 


(9.37) 


A  more  precise  statement  is  that  (9.37)  is  a  bound  on  the  number  of  iterations  to 
compute  an  extremely  good  approximation  of  the  solution. 


Damped  Newton  phase 

We  now  establish  the  inequality  (9.32).  Assume  ||V/(a;)||2  >  rj.  We  first  derive  a 
lower  bound  on  the  step  size  selected  by  the  line  search.  Strong  convexity  implies 
that  V2/(a’)  A  MI  on  S,  and  therefore 

f(x  +  fAxnt)  <  f(x)  +  tS7 f(x)T AcCnt  +  t2 

<  /( x)  -  t\(x)2  +  ^-t2\{x)2, 
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where  we  use  (9.30)  and 

X(x)2  =  Ax^2 f(x)Axnt  >  m||Aa’nt||2- 
The  step  size  t  =  m/M  satisfies  the  exit  condition  of  the  line  search,  since 

f{x  +  iAxnt)  <  f(x)  -  7^X{x)2  <  f(x)  -  atX(x)2. 

Therefore  the  line  search  returns  a  step  size  t  >  /3 m/M,  resulting  in  a  decrease  of 
the  objective  function 

f(x+ )  -  f(x)  <  —atX(x)2 

<  -af3^-X(x)2 
M  v  ' 


m 

M*' 

2 


<  -aPikW  V/(*)ll! 


where  we  use 


X(x)2  =  \7f{Xy  VV(*)"1V/(*)  >  (l/M)||V/(z) 


Therefore,  (9.32)  is  satisfied  with 


7  =  a  fir/2  -^2  • 


(9.38) 


Quadratically  convergent  phase 


We  now  establish  the  inequality  (9.33).  Assume  ||V/(x)||2  <  rj.  We  first  show  that 
the  backtracking  line  search  selects  unit  steps,  provided 


77  <  3(1  —  2a) 


m 


By  the  Lipschitz  condition  (9.31),  we  have,  for  t  >  0, 

II V2/(ar  +  tAxnt)  -  V2/(a:) ||2  <  tL\\Axnth, 

and  therefore 

\Axlt  (y2 f(x  +  tAxnt)  ~\72f{x))  Ax„t|  <  tL||Ax„t|| 


With  /(f)  =  f(x  +  tAxnt),  we  have  f"(t)  =  A x//tV2f(x  +  tAxnt)Axnt,  so  the 
inequality  above  is 

\f"(t)-fm<tL\\Axnt\\l 

We  will  use  this  inequality  to  determine  an  upper  bound  on  f(t).  We  start  with 


fit)  <  f"(0)  +tL\\Axnt\\l  <  X(x)2  +t^j^X(x)3, 
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where  we  use  /"( 0)  =  A(a;)2  and  A(x)2  >  m||Aa;nt||2-  We  integrate  the  inequality 
to  get 

fit)  <  f{  0)  +  t\{x)2  +  t2i2^2Xixf 

=  -\{x)2  +  t\{x)2  +  t“  2m3/2  ^(X)3’ 

using  /'( 0)  =  — A(a:)2.  We  integrate  once  more  to  get 

fit)  <  fi 0)  -  t\{x)2  + 12^ Xix)2  +  f-^j^Xix)3. 

Finally,  we  take  t  =  1  to  obtain 

fix  +  Axnt)  <  fix)  -  ^A(x)2  +  -^^Xixf.  (9.39) 

Now  suppose  ||V/(a:)||2  <  rj  <  3(1  —  2 a)m2/L.  By  strong  convexity,  we  have 

Xix)  <  3(1  -  2 a)m3/2/L, 

and  by  (9.39)  we  have 

fix  +  Axnt)  <  fix)-  Xix)2  Q- 

<  fix)  —  aXix)2 
=  fix)  +  aV/(i)T  Axnt, 

which  shows  that  the  unit  step  t  =  1  is  accepted  by  the  backtracking  line  search. 

Let  us  now  examine  the  rate  of  convergence.  Applying  the  Lipschitz  condition, 
we  have 
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Figure  9.19  Newton’s  method  for  the  problem  in  R2,  with  objective  /  given 
in  (9.20),  and  backtracking  line  search  parameters  a  =  0.1,  /3  =  0.7.  Also 
shown  are  the  ellipsoids  {x  \  \\x  —  ||  vif,x(k)\  <  1}  at  the  first  two  iterates. 


9.5.4  Examples 

Example  in  R2 

We  first  apply  Newton’s  method  with  backtracking  line  search  on  the  test  func¬ 
tion  (9.20),  with  line  search  parameters  a  =  0.1,  /?  =  0.7.  Figure  9.19  shows  the 
Newton  iterates,  and  also  the  ellipsoids 

{X  |  \\x  -  £C(fc)  ||  V2 /(a,(fc))  <  1} 

for  the  first  two  iterates  k  =  0,  1.  The  method  works  well  because  these  ellipsoids 
give  good  approximations  of  the  shape  of  the  sublevel  sets. 

Figure  9.20  shows  the  error  versus  iteration  number  for  the  same  example. 
This  plot  shows  that  convergence  to  a  very  high  accuracy  is  achieved  in  only  five 
iterations.  Quadratic  convergence  is  clearly  apparent:  The  last  step  reduces  the 
error  from  about  10~5  to  10~10. 

Example  in  R100 

Figure  9.21  shows  the  convergence  of  Newton’s  method  with  backtracking  and  exact 
line  search  for  a  problem  in  R100.  The  objective  function  has  the  form  (9.21),  with 
the  same  problem  data  and  the  same  starting  point  as  was  used  in  figure  9.6.  The 
plot  for  the  backtracking  line  search  shows  that  a  very  high  accuracy  is  attained  in 
eight  iterations.  Like  the  example  in  R2,  quadratic  convergence  is  clearly  evident 
after  about  the  third  iteration.  The  number  of  iterations  in  Newton’s  method 
with  exact  line  search  is  only  one  smaller  than  with  a  backtracking  line  search. 
This  is  also  typical.  An  exact  line  search  usually  gives  a  very  small  improvement  in 
convergence  of  Newton’s  method.  Figure  9.22  shows  the  step  sizes  for  this  example. 
After  two  damped  steps,  the  steps  taken  by  the  backtracking  line  search  are  all  full, 
i.e.,  t  =  1. 

Experiments  with  the  values  of  the  backtracking  parameters  a  and  /3  reveal  that 
they  have  little  effect  on  the  performance  of  Newton’s  method,  for  this  example 
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Figure  9.20  Error  versus  iteration  k  of  Newton’s  method  for  the  problem 
in  R2.  Convergence  to  a  very  high  accuracy  is  achieved  in  five  iterations. 


Figure  9.21  Error  versus  iteration  for  Newton’s  method  for  the  problem  in 
R100.  The  backtracking  line  search  parameters  are  a  =  0.01,  fi  =  0.5.  Here 
too  convergence  is  extremely  rapid:  a  very  high  accuracy  is  attained  in  only 
seven  or  eight  iterations.  The  convergence  of  Newton’s  method  with  exact 
line  search  is  only  one  iteration  faster  than  with  backtracking  line  search. 
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Figure  9.22  The  step  size  t  versus  iteration  for  Newton’s  method  with  back¬ 
tracking  and  exact  line  search,  applied  to  the  problem  in  R100.  The  back¬ 
tracking  line  search  takes  one  backtracking  step  in  the  first  two  iterations. 
After  the  first  two  iterations  it  always  selects  t  =  1. 


(and  others).  With  a  fixed  at  0.01,  and  values  of  j3  varying  between  0.2  and  1, 
the  number  of  iterations  required  varies  between  8  and  12.  With  (3  fixed  at  0.5, 
the  number  of  iterations  is  8,  for  all  values  of  a  between  0.005  and  0.5.  For  these 
reasons,  most  practical  implementations  use  a  backtracking  line  search  with  a  small 
value  of  a,  such  as  0.01,  and  a  larger  value  of  (3,  such  as  0.5. 

Example  in  R10000 

In  this  last  example  we  consider  a  larger  problem,  of  the  form 

n  m 

minimize  —  log(l  —  xf)  —  log(6j  —  af  x) 

i— 1  i=l 

with  m  =  100000  and  n  =  10000.  The  problem  data  a*  are  randomly  generated 
sparse  vectors.  Figure  9.23  shows  the  convergence  of  Newton’s  method  with  back¬ 
tracking  line  search,  with  parameters  a  =  0.01,  (3  =  0.5.  The  performance  is  very 
similar  to  the  previous  convergence  plots.  A  linearly  convergent  initial  phase  of 
about  13  iterations  is  followed  by  a  quadratically  convergent  phase,  that  achieves 
a  very  high  accuracy  in  4  or  5  more  iterations. 

Affine  invariance  of  Newton’s  method 

A  very  important  feature  of  Newton’s  method  is  that  it  is  independent  of  linear 
(or  affine)  changes  of  coordinates.  Let  x ^  be  the  fcth  iterate  of  Newton’s  method, 
applied  to  /  :  Rn  — ►  R.  Suppose  T  €  Rnxrl  js  nonsingular,  and  define  f(y)  = 
f(Ty).  If  we  use  Newton’s  method  (with  the  same  backtracking  parameters)  to 
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Figure  9.23  Error  versus  iteration  of  Newton’s  method,  for  a  problem 
in  R10000.  A  backtracking  line  search  with  parameters  a  =  0.01,  /?  =  0.5  is 
used.  Even  for  this  large  scale  problem,  Newton’s  method  requires  only  18 
iterations  to  achieve  very  high  accuracy. 


minimize  /,  starting  from  y =  T  then  we  have 

Ty(k)  =  x{k) 

for  all  k.  In  other  words,  Newton’s  method  is  the  same:  The  iterates  are  related 
by  the  same  change  of  coordinates.  Even  the  stopping  criterion  is  the  same,  since 
the  Newton  decrement  for  /  at  y ^  is  the  same  as  the  Newton  decrement  for  /  at 
x^k\  This  is  in  stark  contrast  to  the  gradient  (or  steepest  descent)  method,  which 
is  strongly  affected  by  changes  of  coordinates. 

As  an  example,  consider  the  family  of  problems  given  in  (9.22),  indexed  by  the 
parameter  7,  which  affects  the  condition  number  of  the  sublevel  sets.  We  observed 
(in  figures  9.7  and  9.8)  that  the  gradient  method  slows  to  useless  for  values  of  7 
smaller  than  0.05  or  larger  than  20.  In  contrast,  Newton’s  method  (with  a  =  0.01, 
(3  =  0.5)  solves  this  problem  (in  fact,  to  a  far  higher  accuracy)  in  nine  iterations, 
for  all  values  of  7  between  10-10  and  1010. 

In  a  real  implementation,  with  finite  precision  arithmetic,  Newton’s  method  is 
not  exactly  independent  of  affine  changes  of  coordinates,  or  the  condition  number 
of  the  sublevel  sets.  But  we  can  say  that  condition  numbers  ranging  up  to  very 
large  values  such  as  1010  do  not  adversely  affect  a  real  implementation  of  Newton’s 
method.  For  the  gradient  method,  a  far  smaller  range  of  condition  numbers  can 
be  tolerated.  While  choice  of  coordinates  (or  condition  number  of  sublevel  sets)  is 
a  first-order  issue  for  gradient  and  steepest  descent  methods,  it  is  a  second-order 
issue  for  Newton’s  method;  its  only  effect  is  in  the  numerical  linear  algebra  required 
to  compute  the  Newton  step. 
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Summary 

Newton’s  method  has  several  very  strong  advantages  over  gradient  and  steepest 
descent  methods: 

•  Convergence  of  Newton’s  method  is  rapid  in  general,  and  quadratic  near  x* . 
Once  the  quadratic  convergence  phase  is  reached,  at  most  six  or  so  iterations 
are  required  to  produce  a  solution  of  very  high  accuracy. 

•  Newton’s  method  is  affine  invariant.  It  is  insensitive  to  the  choice  of  coordi¬ 
nates,  or  the  condition  number  of  the  sublevel  sets  of  the  objective. 

•  Newton’s  method  scales  well  with  problem  size.  Its  performance  on  problems 
in  R10000  is  similar  to  its  performance  on  problems  in  R10,  with  only  a  modest 
increase  in  the  number  of  steps  required. 

•  The  good  performance  of  Newton’s  method  is  not  dependent  on  the  choice 
of  algorithm  parameters.  In  contrast,  the  choice  of  norm  for  steepest  descent 
plays  a  critical  role  in  its  performance. 

The  main  disadvantage  of  Newton’s  method  is  the  cost  of  forming  and  storing 
the  Hessian,  and  the  cost  of  computing  the  Newton  step,  which  requires  solving 
a  set  of  linear  equations.  We  will  see  in  §9.7  that  in  many  cases  it  is  possible  to 
exploit  problem  structure  to  substantially  reduce  the  cost  of  computing  the  Newton 
step. 

Another  alternative  is  provided  by  a  family  of  algorithms  for  unconstrained  op¬ 
timization  called  quasi-Newton  methods.  These  methods  require  less  computational 
effort  to  form  the  search  direction,  but  they  share  some  of  the  strong  advantages 
of  Newton  methods,  such  as  rapid  convergence  near  x* .  Since  quasi-Newton  meth¬ 
ods  are  described  in  many  books,  and  tangential  to  our  main  theme,  we  will  not 
consider  them  in  this  book. 


9.6  Self-concordance 

There  are  two  major  shortcomings  of  the  classical  convergence  analysis  of  Newton’s 
method  given  in  §9.5.3.  The  first  is  a  practical  one:  The  resulting  complexity 
estimates  involve  the  three  constants  m,  M,  and  L ,  which  are  almost  never  known 
in  practice.  As  a  result,  the  bound  (9.40)  on  the  number  of  Newton  steps  required 
is  almost  never  known  specifically,  since  it  depends  on  three  constants  that  are,  in 
general,  not  known.  Of  course  the  convergence  analysis  and  complexity  estimate 
are  still  conceptually  useful. 

The  second  shortcoming  is  that  while  Newton’s  method  is  affinely  invariant,  the 
classical  analysis  of  Newton’s  method  is  very  much  dependent  on  the  coordinate 
system  used.  If  we  change  coordinates  the  constants  m,  M,  and  L  all  change.  If 
for  no  reason  other  than  aesthetic,  we  should  seek  an  analysis  of  Newton’s  method 
that  is,  like  the  method  itself,  independent  of  affine  changes  of  coordinates.  In 
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other  words,  we  seek  an  alternative  to  the  assumptions 

ml  <  W2f{x)  <  MI,  ||V2/(ar)  -  V2/(y)||2  <  L\\x  -  y\\2, 

that  is  independent  of  affine  changes  of  coordinates,  and  also  allows  us  to  analyze 
Newton’s  method. 

A  simple  and  elegant  assumption  that  achieves  this  goal  was  discovered  by 
Nesterov  and  Nemirovski,  who  gave  the  name  self-concordance  to  their  condition. 
Self-concordant  functions  are  important  for  several  reasons. 

•  They  include  many  of  the  logarithmic  barrier  functions  that  play  an  impor¬ 
tant  role  in  interior-point  methods  for  solving  convex  optimization  problems. 

•  The  analysis  of  Newton’s  method  for  self-concordant  functions  does  not  de¬ 
pend  on  any  unknown  constants. 

•  Self-concordance  is  an  affine- invariant  property,  i.e.,  if  we  apply  a  linear 
transformation  of  variables  to  a  self-concordant  function,  we  obtain  a  self- 
concordant  function.  Therefore  the  complexity  estimate  that  we  obtain  for 
Newton’s  method  applied  to  a  self-concordant  function  is  independent  of 
affine  changes  of  coordinates. 


9.6.1  Definition  and  examples 

Self-concordant  functions  on  R 

We  start  by  considering  functions  on  R.  A  convex  function  /  :  R  — >  R  is  self- 
concordant  if 

\f"'(x)\  <  2 f\xf'2  (9.41) 

for  all  x  £  dom /.  Since  linear  and  (convex)  quadratic  functions  have  zero  third 
derivative,  they  are  evidently  self-concordant.  Some  more  interesting  examples  are 
given  below. 


Example  9.3  Logarithm  and  entropy. 


•  Negative  logarithm.  The  function  f(x)  =  —log*  is  self-concordant.  Using 
/"(*)  =  l/*2,  /'"(*)  =  — 2/a;3,  we  find  that 

I  2/x3 

2/"(*)3/2  2(1/*2)3/2 


so  the  defining  inequality  (9.41)  holds  with  equality. 

•  Negative  entropy  plus  negative  logarithm.  The  function  /(*)  =  x  log  x  —  log  x  is 
self-concordant.  To  verify  this,  we  use 


/"(*)  = 


*  +  1 


=  - 


x  +  2 


!/'"(*)!  _  x  +  2 
2/"(x)3/2  2(x  +1)3/2- 


to  obtain 
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The  function  on  the  righthand  side  is  maximized  on  R+  by  x  =  0,  where  its 
value  is  1. 

The  negative  entropy  function  by  itself  is  not  self-concordant;  see  exercise  11.13. 


We  should  make  two  important  remarks  about  the  self-concordance  defini¬ 
tion  (9.41).  The  first  concerns  the  mysterious  constant  2  that  appears  in  the 
definition.  In  fact,  this  constant  is  chosen  for  convenience,  in  order  to  simplify  the 
formulas  later  on;  any  other  positive  constant  could  be  used  instead.  Suppose,  for 
example,  that  the  convex  function  /  :  R  — >  R  satisfies 

I /'"(*)  I  <  kf\xf/2  (9.42) 


where  k  is  some  positive  constant.  Then  the  function  f(x)  =  (k2 /A)f(x)  satisfies 

|/'»|  =  {k2 /A)\f"  {x)\ 

<  (fc3/4)/»3/2 

=  (fc3/4)((4/fc2)/"(a:))3/2 

=  V"{xf'2 


and  therefore  is  self-concordant.  This  shows  that  a  function  that  satisfies  (9.42) 
for  some  positive  k  can  be  scaled  to  satisfy  the  standard  self-concordance  inequal¬ 
ity  (9.41).  So  what  is  important  is  that  the  third  derivative  of  the  function  is 
bounded  by  some  multiple  of  the  3/2-power  of  its  second  derivative.  By  appropri¬ 
ately  scaling  the  function,  we  can  change  the  multiple  to  the  constant  2. 

The  second  comment  is  a  simple  calculation  that  shows  why  self-concordance 
is  so  important:  it  is  affine  invariant.  Suppose  we  define  the  function  /  by  /(y)  = 
f(ay  +  b),  where  a/0.  Then  /  is  self-concordant  if  and  only  if  /  is.  To  see  this, 
we  substitute 

f"{y)=a2f"{x),  =  a3  f"{x), 

where  x  —  ay  +  b,  into  the  self-concordance  inequality  for  /,  i.e.,  \f"'(y)\  < 
2/"(y)3/2,  to  obtain 

|a3/'"0r)|  <  2 (a2/"(x))3/2, 

which  (after  dividing  by  a3)  is  the  self-concordance  inequality  for  /.  Roughly 
speaking,  the  self-concordance  condition  (9.41)  is  a  way  to  limit  the  third  derivative 
of  a  function,  in  a  way  that  is  independent  of  affine  coordinate  changes. 

Self-concordant  functions  on  R” 

We  now  consider  functions  on  R"  with  n  >  1.  We  say  a  function  /  :  R”  — >  R 
is  self-concordant  if  it  is  self-concordant  along  every  line  in  its  domain,  i.e.,  if  the 
function  /(<)  =  /( x  +  tv)  is  a  self-concordant  function  of  t  for  all  x  £  dom  /  and 
for  all  v. 


9.6  Self-concordance 


499 


9.6.2  Self-concordant  calculus 

Scaling  and  sum 

Self-concordance  is  preserved  by  scaling  by  a  factor  exceeding  one:  If  /  is  self- 
concordant  and  a  >  1,  then  af  is  self-concordant.  Self-concordance  is  also  preserved 
by  addition:  If  /i,  /2  are  self-concordant,  then  /i  +  /2  is  self-concordant.  To  show 
this,  it  is  sufficient  to  consider  functions  /i,  /2  :  R  — >  R.  We  have 

\fZ'(x)  +  f"(x)\  <  \ff'(x)\  +  \f^(x)\ 

<  2 

<  2(/r(a:)  +  /"(^))3/2. 


In  the  last  step  we  use  the  inequality 

(u3'2+V3'2)V3<U  +  V, 


which  holds  for  u,  v  >  0. 

Composition  with  affine  function 

If  /  :  R"  — >  R  is  self-concordant,  and  A  £  Rnxm,  b  £  R”,  then  f(Ax  +  b)  is 
self-concordant . 


Example  9.4  Log  barrier  for  linear  inequalities.  The  function 

m 

f{x)  =  log(6i  -  ajx ), 

i= 1 

with  dom/  =  {x  \  aj x  <  bi,  i  =  1, . . .  ,m},  is  self-concordant.  Each  term  —  log(6;  — 
af  x )  is  the  composition  of  —  log  y  with  the  affine  transformation  y  =  bi  —  af  x,  and 
hence  self-concordant.  Therefore  the  sum  is  also  self-concordant. 


Example  9.5  Log-determinant.  The  function  f(X)  =  —  logdet.Y  is  self-concordant 
on  dom  /  =  S”  +  .  To  show  this,  we  consider  the  function  /(f)  =  f(X  +  tV),  where 
X  >-  0  and  F  e  S".  It  can  be  expressed  as 

/(f)  =  -\ogdet(X1/2(I  +  tX~1/2VX~1/2)X1/2) 

=  -  log  det  X  -  log  det(J  +  fX"1/2TX"1/2) 

n 

=  —  log  det  X  —  log(l  +  t\j) 

i=  1 

where  A i  are  the  eigenvalues  of  X_1',2’EX”1,/2.  Each  term  —  log(l  -I-  fAi)  is  a  self- 
concordant  function  of  f,  so  the  sum,  /,  is  self-concordant.  It  follows  that  /  is 
self-concordant . 


Example  9.6  Log  of  concave  quadratic.  The  function 

f{x)  =  —  log  (xT  Px  +  qT  x  +  r), 
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where  P  £  —  SIJl ,  is  self-concordant  on 

dom  /  =  {*  |  xT Px  +  qT x  +  r  >  0}. 

To  show  this,  it  suffices  to  consider  the  case  n  =  1  (since  by  restricting  /  to  a  line, 
the  general  case  reduces  to  the  n  =  1  case).  We  can  then  express  /  as 

/(*)  =  —  log(p*2  +  qx  +  i')  =  —  log  (— p{x  —  a) (6  —  *)) 

where  dom  f  =  (a,  b)  ( i.e .,  a  and  b  are  the  roots  of  px2  +qx+r).  Using  this  expression 
we  have 

/(*)  =  -  log(-p)  -  log(*  -a)  -  log  (6  -  x), 
which  establishes  self-concordance. 


Composition  with  logarithm 

Let  g  :  R  — >  R  be  a  convex  function  with  dom  g  =  R++ ,  and 

\9"\x)\  <  (9-43) 

for  all  x.  Then 

f(x)  =  -  log(— <?(*))  -  log  x 

is  self-concordant  on  {x  \  x  >  0,  g(x)  <  0}.  (For  a  proof,  see  exercise  9.14.) 

The  condition  (9.43)  is  homogeneous  and  preserved  under  addition.  It  is  sat¬ 
isfied  by  all  (convex)  quadratic  functions,  i.e.,  functions  of  the  form  ax 2  +  bx  +  c, 
where  a  >  0.  Therefore  if  (9.43)  holds  for  a  function  g,  then  it  holds  for  the  function 
g(x)  +  ax2  +  bx  +  c,  where  a  >  0. 


Example  9.7  The  following  functions  g  satisfy  the  condition  (9.43). 

•  g(x)  =  —xp  for  0  <  p  <  1. 

•  g(x)  =  -  log*. 

•  g(x)  =  a;  log*. 

•  g(x)  =  xv  for  —  1  <  p  <  0. 

•  g{x)  =  ( ax  +  b)2/x. 

It  follows  that  in  each  case,  the  function  /(*)  =  —  log(— g(x))— log  x  is  self-concordant. 
More  generally,  the  function  /(*)  =  —  log(— g(x)  —  ax2  —  bx  —  c)  —  log*  is  self- 
concordant  on  its  domain, 

{*  |  *  >  0,  g(x)  +  ax2  +  bx  +  c  <  0}, 

provided  a  >  0. 


Example  9.8  The  composition  with  logarithm  rule  allows  us  to  show  self-concordance 
of  the  following  functions. 

•  f(x,y)  =  -  log (y2  ~xTx)  on  {( x,y )  |  ||*||2  <  y}- 

•  f(x,  y)  =  —2  log y  -  log (y2/p  -x2),  with  p>  1,  on  {( x,y )  £  R2  |  \x\p  <  y}. 

•  f[x,  y)  =  log  y  log(log  y  *)  on  {( x,y )  \ex  <y}. 

We  leave  the  details  as  an  exercise  (exercise  9.15). 
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9.6.3  Properties  of  self-concordant  functions 


In  §9.1.2  we  used  strong  convexity  to  derive  bounds  on  the  suboptinrality  of  a  point 
x  in  terms  of  the  norm  of  the  gradient  at  x.  For  strictly  convex  self-concordant 
functions,  we  can  obtain  similar  bounds  in  terms  of  the  Newton  decrement 

A(z)  =  (V/(x)TV2/(a:)_1V/(x))1/2 . 

(It  can  be  shown  that  the  Hessian  of  a  strictly  convex  self-concordant  function  is 
positive  definite  everywhere;  see  exercise  9.17.)  Unlike  the  bounds  based  on  the 
norm  of  the  gradient,  the  bounds  based  on  the  Newton  decrement  are  not  affected 
by  an  affine  change  of  coordinates. 

For  future  reference  we  note  that  the  Newton  decrement  can  also  be  expressed 
as 

— nTV/(x) 

X(x)  =  sup  T  9  ,  ,  u/2 
v^o  ( V 1  V-f(x)v)L/2 

(see  exercise  9.9).  In  other  words,  we  have 


-vT\7  f(x) 
(vTV2f(x)v)1/2 


<  X(x) 


for  any  nonzero  v,  with  equality  for  v  =  Aa’nt. 


(9.44) 


Upper  and  lower  bounds  on  second  derivatives 

Suppose  /  :  R  — >  R  is  a  strictly  convex  self-concordant  function.  We  can  write  the 
self-concordance  inequality  (9.41)  as 


d 

dt 


(9.45) 


for  all  t  £  dom  /  (see  exercise  9.16).  Assuming  t  >  0  and  the  interval  between  0 
and  t  is  in  dom  /,  we  can  integrate  (9.45)  between  0  and  t  to  obtain 


‘-[i  (f"(Tr'/3) dT  -  '■ 


i.e.,  —t  <  f"(t )  x/2  —  /"( 0)  x/2  <  t.  From  this  we  obtain  lower  and  upper  bounds 
on  /"(f): 

no)  /  ^  /  m 


(l+f/"(0)V2)' 


<  A*)  < 


(l-f/"(0)V2)^ 


(9.46) 


The  lower  bound  is  valid  for  all  nonnegative  f  €  dom/;  the  upper  bound  is  valid 
if  t  e  dom  /  and  0  <  f  <  /"( 0)"1/2. 


Bound  on  suboptimality 

Let  /  :  R"  R  be  a  strictly  convex  self-concordant  function,  and  let  v  be  a 
descent  direction  (i.e.,  any  direction  satisfying  uTV f(x)  <  0,  not  necessarily  the 
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Newton  direction).  Define  /  :  R  — >  R  as  /(f)  =  f(x  +  tv).  By  definition,  the 
function  /  is  self-concordant. 

Integrating  the  lower  bound  in  (9.46)  yields  a  lower  bound  on 


ht)  >  f\  o)  +  A  o)i/2  - 


f"(  0)1/2 

1  +  f/"(0)1/2’ 


(9.47) 


Integrating  again  yields  a  lower  bound  on  /(f): 


/(f)  >  /( o)  +  tf'( 0)  +  f/"(0)V2  -  log(l  +  f/"( 0)1/2). 


(9.48) 


The  righthand  side  reaches  its  minimum  at 

f-_  -/'(Q) 

/"(0)  +  /"(0)1/2//(0)’ 

and  evaluating  at  f  provides  a  lower  bound  on  /: 

inf  /(f)  >  /(0)  +  t/'(0)  +  f/"(0)1/2  -  log(l  +  if" (0)1/2) 

=  /(o)  -  /'(o)/"(or1/2  +  log(l  +  /'(o)/"(o)-1/2). 

The  inequality  (9.44)  can  be  expressed  as 

aw  >  -mnori/2 

(with  equality  when  v  =  Aa;nt),  since  we  have 

/'( 0)  =  vTV.f(x),  f"(  0)  =  vTV2f(x)v. 

Now  using  the  fact  that  u  +  log(l  —  u)  is  a  monotonically  decreasing  function  of  u, 
and  the  inequality  above,  we  get 

inf /(f)  >  /( 0)  +  A(x)  +  log(l  -  A(x)). 

This  inequality  holds  for  any  descent  direction  v.  Therefore 

P*  >  /( x)  +  A(a;)  +  log(l  -  A(a;))  (9.49) 

provided  A{x)  <  1.  The  function  —  (A  +  log(l  —  A))  is  plotted  in  figure  9.24.  It 
satisfies 

—  (A  +  log(l  —  A))  «  A2/2, 

for  small  A,  and  the  bound 

-  (A  +  log(l  -  A))  <  A2 


for  A  <  0.68.  Thus,  we  have  the  bound  on  suboptimality 

P *  >  f(x)  -  A{x)\  (9.50) 

valid  for  A(x)  <  0.68. 

Recall  that  A{x)2 /2  is  the  estimate  of  f(x)  —p*,  based  on  the  quadratic  model 
at  x;  the  inequality  (9.50)  shows  that  for  self-concordant  functions,  doubling  this 
estimate  gives  us  a  provable  bound.  In  particular,  it  shows  that  for  self-concordant 
functions,  we  can  use  the  stopping  criterion 

A(x)2  <  e, 

(where  e  <  0.682),  and  guarantee  that  on  exit  f(x)  —  p*  <  e. 
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Figure  9.24  The  solid  line  is  the  function  —  (A  +  log(l  — A)),  which  for  small  A 
is  approximately  A2/2.  The  dashed  line  shows  A2,  which  is  an  upper  bound 
in  the  interval  0  <  A  <  0.68. 


9.6.4  Analysis  of  Newton’s  method  for  self-concordant  functions 

We  now  analyze  Newton’s  method  with  backtracking  line  search,  when  applied  to 
a  strictly  convex  self-concordant  function  /.  As  before,  we  assume  that  a  starting 
point  a;®  is  known,  and  that  the  sublevel  set  S  =  {x  |  /( x)  <  f{x^)}  is  closed. 
We  also  assume  that  /  is  bounded  below.  (This  implies  that  /  has  a  minimizer  x *; 
see  exercise  9.19.) 

The  analysis  is  very  similar  to  the  classical  analysis  given  in  §9.5.2,  except  that 
we  use  self-concordance  as  the  basic  assumption  instead  of  strong  convexity  and 
the  Lipscliitz  condition  on  the  Hessian,  and  the  Newton  decrement  will  play  the 
role  of  the  norm  of  the  gradient.  We  will  show  that  there  are  numbers  77  and  7  >  0, 
with  0  <  77  <  1/4,  that  depend  only  on  the  line  search  parameters  a  and  /?,  such 
that  the  following  hold: 

•  If  X(x^)  >  77,  then 

/0r(fc+1))-/(*(fe))<-7-  (9-51) 

•  If  X(x^)  <  77,  then  the  backtracking  line  search  selects  t  =  1  and 

2X(x{k+1'>)  <  (2X (:r(fe)))2.  (9.52) 

These  are  the  analogs  of  (9.32)  and  (9.33).  As  in  §9.5.3,  the  second  condition  can 
be  applied  recursively,  so  we  can  conclude  that  for  all  l  >  k,  we  have  X(x^)  <  77, 
and 

2A(*<‘>)  <  (2A(:r(fe)))2  <  (2,?)2i-fc  <  Q)  . 

As  a  consequence,  for  all  l  >  k. 


f(x{l))-p*  <  X(x^)2  <  i 
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and  hence  f(x^)  —  p*  <  e  if  l  —  k  >  log2  log2(l/e). 

The  first  inequality  implies  that  the  damped  phase  cannot  require  more  than 
(/0r(o))  —  p*)/ 7  steps.  Thus  the  total  number  of  iterations  required  to  obtain  an 
accuracy  f(x)  —  p*  <  e,  starting  at  a  point  x^°\  is  bounded  by 

— — - — —  +  log2  log2(l/e).  (9.53) 

7 

This  is  the  analog  of  the  bound  (9.36)  in  the  classical  analysis  of  Newton’s  method. 


Damped  Newton  phase 

Let  f(t)  =  f(x  +  tAxnt ),  so  we  have 

/'(' 0)  =  ~X(x)2,  /"(  0)  =  A(xf. 

If  we  integrate  the  upper  bound  in  (9.46)  twice,  we  obtain  an  upper  bound  for  f(t): 

m  <  /(0)  +  tf'(0)  —  —  log  (l  —  t/,,(0)1^2) 

=  /(0)  —  tA(x)2  —  tA(x)  —  log(l  —  t.A(x)),  (9.54) 


valid  for  0  <  t  <  1/X(x). 

We  can  use  this  bound  to  show  the  backtracking  line  search  always  results  in  a 
step  size  t  >  /3/(  1  +  A(x)).  To  prove  this  we  note  that  the  point  t  =  1/(1  +  A(x)) 
satisfies  the  exit  condition  of  the  line  search: 


hi)  < 


< 


/( o)  -  tX(x)2  -  tX(x)  -  log(l 
/( 0)  -  A(x)  +  log(l  +  A(x)) 
f(n,  A(x)2 

/(0)““TTaw 

/( 0)  —  aX(x)2t. 


tX(x)) 


The  second  inequality  follows  from  the  fact  that 

-I  +  iog(i  +  l)  +  2(TT^£0 


for  x  >  0.  Since  t>/3/(  1  +  X(x)),  we  have 

m  m  < 

1  +  X(x) 


7  =  af 3 


1 


so  (9.51)  holds  with 
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Quadratically  convergent  phase 

We  will  show  that  we  can  take 


V  =  (l~  2a)/4, 

(which  satisfies  0  <  77  <  1/4,  since  0  <  a  <  1/2),  i.e.,  if  X(x^)  <  (1  —  2a)/4,  then 
the  backtracking  line  search  accepts  the  unit  step  and  (9.52)  holds. 

We  first  note  that  the  upper  bound  (9.54)  implies  that  a  unit  step  t  =  1  yields  a 
point  in  dom/  if  A(x)  <  1.  Moreover,  if  A(x)  <  (1  —  2a)/2,  we  have,  using  (9.54), 

/(l)  <  /( 0)  -  A(x)2  -  A(x)  -  log(l  -  A(x)) 

<  f(0)-^X(x)2  +  X(x)3 

<  /( 0)  -  a\(x)2, 

so  the  unit  step  satisfies  the  condition  of  sufficient  decrease.  (The  second  line 
follows  from  the  fact  that  —  x  —  log(l  —  x)  <  |x2  +  x 3  for  0  <  x  <  0.81.) 

The  inequality  (9.52)  follows  from  the  following  fact,  proved  in  exercise  9.18.  If 
X(x)  <  1,  and  x+  =  x  —  V2/(x)_1  V/(x),  then 

A(l+)  £  (i-A  Ur  (9’55) 

In  particular,  if  A(x)  <1/4, 

A(x+)  <  2A(x)2, 

which  proves  that  (9.52)  holds  when  X(x^)  <  rj. 

The  final  complexity  bound 

Putting  it  all  together,  the  bound  (9.53)  on  the  number  of  Newton  iterations  be¬ 
comes 

/('r(°)7)~PVlog2log2(l/e)  =  ^^(/(/>)-/)  +  log2log2(l/e).  (9.56) 

This  expression  depends  only  on  the  line  search  parameters  a  and  /?,  and  the  final 
accuracy  e.  Moreover  the  term  involving  e  can  be  safely  replaced  by  the  constant 
six,  so  the  bound  really  depends  only  on  a  and  /3.  For  typical  values  of  a  and  /3,  the 
constant  that  scales  /(a/0))  —  p*  is  on  the  order  of  several  hundred.  For  example, 
with  a  =  0.1,  /3  =  0.8,  the  scaling  factor  is  375.  With  tolerance  e  =  10~10,  we 
obtain  the  bound 

375(/(x(0))  -p*)  +  6.  (9.57) 

We  will  see  that  this  bound  is  fairly  conservative,  but  does  capture  what  appears 
to  be  the  general  form  of  the  worst-case  number  of  Newton  steps  required.  A  more 
refined  analysis,  such  as  the  one  originally  given  by  Nesterov  and  Nemirovski,  gives 
a  similar  bound,  with  a  substantially  smaller  constant  scaling  f(x^)  —  p*. 
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Figure  9.25  Number  of  Newton  iterations  required  to  minimize  self- 
concordant  functions  versus  f(x^)  —  p*.  The  function  /  has  the  form 
f[x)  =  —  ,  log (bj  —  ajx),  where  the  problem  data  a;  and  b  are  ran¬ 

domly  generated.  The  circles  show  problems  with  m  =  100,  n  =  50;  the 
squares  show  problems  with  m  =  1000,  n  =  500;  and  the  diamonds  show 
problems  with  m  =  1000,  n  =  50.  Fifty  instances  of  each  are  shown. 


9.6.5  Discussion  and  numerical  examples 

A  family  of  self-concordant  functions 

It  is  interesting  to  compare  the  upper  bound  (9.57)  with  the  actual  number  of 
iterations  required  to  minimize  a  self-concordant  function.  We  consider  a  family  of 
problems  of  the  form 

m 

f(x)  =  -^log(6j  ~aix). 

i=l 

The  problem  data  a,;  and  b  were  generated  as  follows.  For  each  problem  instance, 
the  coefficients  of  a,;  were  generated  from  independent  normal  distributions  with 
mean  zero  and  unit  variance,  and  the  coefficients  b  were  generated  from  a  uniform 
distribution  on  [0,1].  Problem  instances  which  were  unbounded  below  were  dis¬ 
carded.  For  each  problem  we  first  compute  x* .  We  then  generate  a  starting  point 
by  choosing  a  random  direction  v,  and  taking  an0-*  =  x*  +  sv,  where  s  is  chosen  so 
that  f(x (°))  —  p *  has  a  prescribed  value  between  0  and  35.  (We  should  point  out 
that  starting  points  with  values  f(x^)  —  p*  =  10  or  higher  are  actually  very  close 
to  the  boundary  of  the  polyhedron.)  We  then  minimize  the  function  using  New¬ 
ton’s  method  with  a  backtracking  line  search  with  parameters  a  =  0.1,  /?  =  0.8, 
and  tolerance  e  =  10~10. 

Figure  9.25  shows  the  number  of  Newton  iterations  required  versus  f(x^)  —  p* 
for  150  problem  instances.  The  circles  show  50  problems  with  m  =  100,  n  =  50; 
the  squares  show  50  problems  with  m  =  1000,  n  =  500;  and  the  diamonds  show  50 
problems  with  in  =  1000,  n  =  50. 


9.6  Self-concordance 


507 


For  the  values  of  the  backtracking  parameters  used,  the  complexity  bound  found 
above  is 


375(/(a^°))  —  p*)  +  6, 


(9.58) 


clearly  a  much  larger  value  than  the  number  of  iterations  required  (for  these  150 
instances).  The  plot  suggests  that  there  is  a  valid  bound  of  the  same  form,  but 
with  a  much  smaller  constant  (say,  around  1.5)  scaling  /(a/0))  —  p* .  Indeed,  the 
expression 


f(x ^)  —  p*  +  6 


is  not  a  bad  gross  predictor  of  the  number  of  Newton  steps  required,  although  it  is 
clearly  not  the  only  factor.  First,  there  are  plenty  of  problems  instances  where  the 
number  of  Newton  steps  is  somewhat  smaller,  which  correspond,  we  can  guess,  to 
‘lucky’  starting  points.  Note  also  that  for  the  larger  problems,  with  500  variables 
(represented  by  the  squares),  there  seem  to  be  even  more  cases  where  the  number 
of  Newton  steps  is  unusually  small. 

We  should  mention  here  that  the  problem  family  we  study  is  not  just  self- 
concordant,  but  in  fact  minimally  self-concordant,  by  which  we  mean  that  af 
is  not  self-concordant  for  a  <  1.  Hence,  the  bound  (9.58)  cannot  be  improved 
by  simply  scaling  /.  (The  function  /( x)  =  —20  log  a;  is  an  example  of  a  self- 
concordant  function  which  is  not  minimally  self-concordant,  since  (1/20)/  is  also 
self-concordant.) 

Practical  importance  of  self-concordance 

We  have  already  observed  that  Newton’s  method  works  in  general  very  well  for 
strongly  convex  objective  functions.  We  can  justify  this  vague  statement  empir¬ 
ically,  and  also  using  the  classical  analysis  of  Newton’s  method,  which  yields  a 
complexity  bound,  but  one  that  depends  on  several  constants  that  are  almost  al¬ 
ways  unknown. 

For  self-concordant  functions  we  can  say  somewhat  more.  We  have  a  complexity 
bound  that  is  completely  explicit,  and  does  not  depend  on  any  unknown  constants. 
Empirical  studies  suggest  that  this  bound  can  be  tightened  considerably,  but  its 
general  form,  a  small  constant  plus  a  multiple  of  /(an0))  —  p*,  seems  to  predict,  at 
least  crudely,  the  number  of  Newton  steps  required  to  minimize  an  approximately 
minimally  self-concordant  function. 

It  is  not  yet  clear  whether  self-concordant  functions  are  in  practice  more  easily 
minimized  by  Newton’s  method  than  non-self-concordant  functions.  (It  is  not 
even  clear  how  one  would  make  this  statement  precise.)  At  the  moment,  we  can 
say  that  self-concordant  functions  are  a  class  of  functions  for  which  we  can  say 
considerably  more  about  the  complexity  of  Newton’s  method  than  is  the  case  for 
non-self-concordant  functions. 
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9.7  Implementation 

In  this  section  we  discuss  some  of  the  issues  that  arise  in  implementing  an  un¬ 
constrained  minimization  algorithm.  We  refer  the  reader  to  appendix  C  for  more 
details  on  numerical  linear  algebra. 


9.7.1  Pre-computation  for  line  searches 

In  the  simplest  implementation  of  a  line  search,  f(x  +  tAx)  is  evaluated  for  each 
value  of  t  in  the  same  way  that  f(z)  is  evaluated  for  any  2  £  dom /.  But  in  some 
cases  we  can  exploit  the  fact  that  /  (and  its  derivatives,  in  an  exact  line  search)  are 
to  be  evaluated  at  many  points  along  the  ray  {x  + 1 Ax  \  t  >  0}  to  reduce  the  total 
computational  effort.  This  usually  requires  some  pre-computation,  which  is  often 
on  the  same  order  as  computing  /  at  any  point,  after  which  /  (and  its  derivatives) 
can  be  computed  more  efficiently  along  the  ray. 

Suppose  that  x  £  dom  /  and  Aa;  £  Rn,  and  define  /  as  /  restricted  to  the  line 
or  ray  determined  by  x  and  Ax,  i.e.,  f(t)  =  f(x  +  tAx).  In  a  backtracking  line 
search  we  must  evaluate  /  for  several,  and  possibly  many,  values  of  t;  in  an  exact 
line  search  method  we  must  evaluate  /  and  one  or  more  derivatives  at  a  number  of 
values  of  t.  In  the  simple  method  described  above,  we  evaluate  f(t)  by  first  forming 
z  =  x  +  tAx,  and  then  evaluating  f(z).  To  evaluate  f'{t),  we  form  z  =  x  +  tAx, 
then  evaluate  Vf(z),  and  then  compute  f(t)  =  \/  f{z)T  Ax.  In  some  representative 
examples  below  we  show  how  /  can  be  computed  at  a  number  of  values  of  t  more 
efficiently. 

Composition  with  an  affine  function 

A  very  general  case  in  which  pre-computation  can  speed  up  the  line  search  process 
occurs  when  the  objective  has  the  form  f(x)  =  <j>(Ax  +  b),  where  A  £  Rpx",  and  <j> 
is  easy  to  evaluate  (for  example,  separable).  To  evaluate  f(t)  =  f(x  +  tAx)  for  k 
values  of  t  using  the  simple  approach,  we  form  A(x  +  tAx)  +  b  for  each  value  of  t 
(which  costs  2 kpn  flops),  and  then  evaluate  <f>(A(x  +  tAx)  +  b)  for  each  value  of  t. 
This  can  be  done  more  efficiently  by  first  computing  Ax  +  b  and  A  Ax  (4pn  flops), 
then  forming  A(x  +  tAx)  +  b  for  each  value  of  t  using 

A(x  +  tAx)  +  b  =  ( Ax  +  b)  +  t(AAx), 

which  costs  2 kp  flops.  The  total  cost,  keeping  only  the  dominant  terms,  is  Apn+2kp 
flops,  compared  to  2 kpn  for  the  simple  method. 

Analytic  center  of  a  linear  matrix  inequality 

Here  we  give  an  example  that  is  more  specific,  and  more  complete.  We  consider 
the  problem  (9.6)  of  computing  the  analytic  center  of  a  linear  matrix  inequality, 
i.e.,  minimizing  log det  F(x)~1,  where  x  £  Rn  and  F  :  R"  —>  Sp  is  affine.  Along 
the  line  through  x  with  direction  Ax  we  have 

f(t)  =  logdet(.F(2:  +  tAx))^1  =  —  logdet(A  +  tB) 
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where 

A  =  F(x),  B  =  Axi.Fi  +  •  •  •  +  A xnFn  €  Sp. 

Since  A  y  0,  it  has  a  Cholesky  factorization  A  =  LLT ,  where  L  is  lower  triangular 
and  nonsingular.  Therefore  we  can  express  /  as 

p 

f(t)  =  —  log  det  ( L(I  +  fL_1  BL~t)Lt)  =  —  log  det  A  —  ^  log(l  +  f A,)  (9.59) 

*= 1 

where  Ai,...,Ap  are  the  eigenvalues  of  L~1BL~T.  Once  these  eigenvalues  are 
computed,  we  can  evaluate  fit),  for  any  t,  with  4 p  simple  arithmetic  computations, 
by  using  the  formula  on  the  right  hand  side  of  (9.59).  We  can  evaluate  f'(t)  (and 
similarly,  any  higher  derivative)  in  4 p  operations,  using  the  formula 


/'(*) 


E 


A, 

1  +  tXj 


Let  us  compare  the  two  methods  for  carrying  out  a  line  search,  assuming  that 
we  need  to  evaluate  f(x  +  t Ax)  for  k  values  of  t.  In  the  simple  method,  for  each 
value  of  t  we  form  F{x+tAx),  and  then  evaluate  f(x+tAx)  as  —  log  det  F(x+tAx). 
For  example,  we  can  find  the  Cholesky  factorization  of  F(x  +  tAx)  =  LLT ,  and 
then  evaluate 

p 

—  log  det  F(x  + 1  Ax)  =  —  2  ^  log  La . 

i= 1 

The  cost  is  np2  to  form  F(x  +  tAx),  plus  (l/3)p3  for  the  Cholesky  factorization. 
Therefore  the  total  cost  of  the  line  search  is 

k(np 2  +  (l/3)p3)  =  knp2  +  (1/3  )kp3. 

Using  the  method  outlined  above,  we  first  form  A,  which  costs  np2,  and  factor 
it,  which  costs  (l/3)p3.  We  also  form  B  (which  costs  np2),  and  L~1BL~T ,  which 
costs  2 p3.  The  eigenvalues  of  this  matrix  are  then  computed,  at  a  cost  of  about 
(4/3)p3  flops.  This  pre-computation  requires  a  total  of  2 np2  +  (ll/3)p3  flops.  After 
finishing  this  pre-computation,  we  can  now  evaluate  /(f)  for  each  value  of  t  at  a 
cost  of  4 p  flops.  The  total  cost  is  then 

2  np2  +  (ll/3)p3  +  4  kp. 

Assuming  k  is  small  compared  to  p(2n+  (ll/3)p),  this  means  the  entire  line  search 
can  be  carried  out  at  an  effort  comparable  to  simply  evaluating  f.  Depending  on 
the  values  of  k,  p,  and  n,  the  savings  over  the  simple  method  can  be  as  large  as 
order  k. 


9.7.2  Computing  the  Newton  step 

In  this  section  we  briefly  describe  some  of  the  issues  that  arise  in  implementing 
Newton’s  method.  In  most  cases,  the  work  of  computing  the  Newton  step  Axnt 
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dominates  the  work  involved  in  the  line  search.  To  compute  the  Newton  step 
Axnt ,  we  first  evaluate  and  form  the  Hessian  matrix  H  =  V2/(:r)  and  the  gradient 
g  =  N  f(x)  at  x.  Then  we  solve  the  system  of  linear  equations  HAxn%  =  —g  to 
find  the  Newton  step.  This  set  of  equations  is  sometimes  called  the  Newton  system 
(since  its  solution  gives  the  Newton  step)  or  the  normal  equations,  since  the  same 
type  of  equation  arises  in  solving  a  least-squares  problem  (see  §9.1.1). 

While  a  general  linear  equation  solver  can  be  used,  it  is  better  to  use  methods 
that  take  advantage  of  the  symmetry  and  positive  definiteness  of  H.  The  most 
common  approach  is  to  form  the  Cholesky  factorization  of  H ,  i.e.,  to  compute  a 
lower  triangular  matrix  L  that  satisfies  LLT  =  H  (see  §C.3.2).  We  then  solve  Lw  = 
—g  by  forward  substitution,  to  obtain  w  =  —L~1g,  and  then  solve  LTAxnt  =  w  by 
back  substitution,  to  obtain 

Accnt  =  L~tw  =  —L~TL~1g  =  —H~1g. 

We  can  compute  the  Newton  decrement  as  A2  =  —A x^tg,  or  use  the  formula 

\>=gTH-1g  =  \\L-1g\\t  =  \\w\\l 

If  a  dense  (unstructured)  Cholesky  factorization  is  used,  the  cost  of  the  forward  and 
back  substitution  is  dominated  by  the  cost  of  the  Cholesky  factorization,  which  is 
(l/3)?i3  flops.  The  total  cost  of  computing  the  Newton  step  Axnt  is  thus  F+(l/3)?r3 
flops,  where  F  is  the  cost  of  forming  H  and  g. 

It  is  often  possible  to  solve  the  Newton  system  H Axnt  =  — g  more  efficiently, 
by  exploiting  special  structure  in  H,  such  as  band  structure  or  sparsity.  In  this 
context,  ‘structure  of  H'  means  structure  that  is  the  same  for  all  x.  For  example, 
when  we  say  that  lH  is  tridiagonal’  we  mean  that  for  every  x  €  dom /,  V2f(x )  is 
tridiagonal. 

Band  structure 

If  the  Hessian  H  is  banded  with  bandwidth  k,  i.e.,  Hij  =  0  for  \i  —  j\  >  k,  then  the 
banded  Cholesky  factorization  can  be  used,  as  well  as  banded  forward  and  back 
substitutions.  The  cost  of  computing  the  Newton  step  Axnt  =  —H~1g  is  then 
F  +  nk2  flops  (assuming  k  <C  n),  compared  to  F  +  (l/3)n3  for  a  dense  factorization 
and  substitution  method. 

The  Hessian  band  structure  condition 

=  0  for  |*  -  j\  >  k, 

for  all  x  £  dom/,  has  an  interesting  interpretation  in  terms  of  the  objective 
function  /.  Roughly  speaking  it  means  that  in  the  objective  function,  each  variable 
Xj  couples  nonlinearly  only  to  the  2k  +  1  variables  Xj,  j  =  i  —  k, . . .  ,i  +  k.  This 
occurs  when  f  has  the  partial  separability  form 

f(x)  =Xpi(X!,..  .,Xk+ 1)  +  Ip  2^2,  •  •  -,Xk+ 2)  H - b  1pn-k{xn-k,  ...,Xn), 

where  ipi  :  Rfc+1  — >  R.  In  other  words,  /  can  be  expressed  as  a  sum  of  functions 
of  k  consecutive  variables. 
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Example  9.9  Consider  the  problem  of  minimizing  /  :  Rn  — >  R,  which  has  the  form 

/(*)  =  1pl(xi,X2)  +l/>2(X2,X3)  H - +  l/jn-1(xn-l,Xn), 

where  :  R2  —>  R  are  convex  and  twice  differentiable.  Because  of  this  form,  the 
Hessian  V2/  is  tridiagonal,  since  d2 f  /dxtdxj  =  0  for  | i  —  j\  >  1.  (And  conversely,  if 
the  Hessian  of  a  function  is  tridiagonal  for  all  x,  then  it  has  this  form.) 

Using  Cholesky  factorization  and  forward  and  back  substitution  algorithms  for  tridi¬ 
agonal  matrices,  we  can  solve  the  Newton  system  for  this  problem  in  order  n  flops. 
This  should  be  compared  to  order  n3  flops,  if  the  special  form  of  /  were  not  exploited. 


Sparse  structure 

More  generally  we  can  exploit  sparsity  of  the  Hessian  H  in  solving  the  Newton 
system.  This  sparse  structure  occurs  whenever  each  variable  Xi  is  nonlinearly 
coupled  (in  the  objective)  to  only  a  few  other  variables,  or  equivalently,  when  the 
objective  function  can  be  expressed  as  a  sum  of  functions,  each  depending  on  only 
a  few  variables,  and  each  variable  appearing  in  only  a  few  of  these  functions. 

To  solve  H Ax  =  —g  when  H  is  sparse,  a  sparse  Cholesky  factorization  is  used 
to  compute  a  permutation  matrix  P  and  lower  triangular  matrix  L  for  which 

H  =  PLLtPt. 

The  cost  of  this  factorization  depends  on  the  particular  sparsity  pattern,  but  is 
often  far  smaller  than  (l/3)n  ,  and  an  empirical  complexity  of  order  n  (for  large 
n)  is  not  uncommon.  The  forward  and  back  substitution  are  very  similar  to  the 
basic  method  without  the  permutation.  We  solve  Lw  =  —PTg  using  forward 
substitution,  and  then  solve  LTv  =  w  by  back  substitution  to  obtain 

v  =  L~tw  =  —L~TL~1PTg. 

The  Newton  step  is  then  Ax  =  Pv. 

Since  the  sparsity  pattern  of  H  does  not  change  as  x  varies  (or  more  precisely, 
since  we  only  exploit  sparsity  that  does  not  change  with  x)  we  can  use  the  same 
permutation  matrix  P  for  each  of  the  Newton  steps.  The  step  of  determining  a 
good  permutation  matrix  P,  which  is  called  the  symbolic  factorization  step,  can  be 
done  once,  for  the  whole  Newton  process. 

Diagonal  plus  low  rank 

There  are  many  other  types  of  structure  that  can  be  exploited  in  solving  the  New¬ 
ton  system  HAxni  =  —g.  Here  we  briefly  describe  one,  and  refer  the  reader  to 
appendix  C  for  more  details.  Suppose  the  Hessian  H  can  be  expressed  as  a  diago¬ 
nal  matrix  plus  one  of  low  rank,  say,  p.  This  occurs  when  the  objective  function  / 
has  the  special  form 

n 

f(x)  =  ^2  +  i’oiAx  +  b) 

i= 1 


(9.60) 
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where  A  £  Rpxn,  tpi, . . .  ,tpn  :  R  — >  R,  and  ipo  '■  Rp  R-  In  other  words,  / 
is  a  separable  function,  plus  a  function  that  depends  on  a  low  dimensional  affine 
function  of  x. 

To  find  the  Newton  step  Axnt  for  (9.60)  we  must  solve  the  Newton  system 
HAxnt  =  —  g,  with 

H  =  D  +  AtH0A. 

Here  D  =  diag(^"(xi), . . .  ,i/)"(xn))  is  diagonal,  and  H0  =  S72ipo{Ax  +  b)  is  the 
Hessian  of  i/V  If  we  compute  the  Newton  step  without  exploiting  the  structure, 
the  cost  of  solving  the  Newton  system  is  (l/3)n3  flops. 

Let  H0  =  L0Lq  be  the  Cholesky  factorization  of  H0.  We  introduce  the  tempo¬ 
rary  variable  w  =  Lq  AAxnt  £  Rp,  and  express  the  Newton  system  as 

DAxnt  +  AT  L0w  =  — g ,  w  =  L^AAxnt. 

Substituting  Aa;nt  =  —D~1{AtLqw  +  g)  (from  the  first  equation)  into  the  second 
equation,  we  obtain 


(I  +  Lq  AD^1  AT L0)w  =  —Lq  AD_1g,  (9.61) 

which  is  a  system  of  p  linear  equations. 

Now  we  proceed  as  follows  to  compute  the  Newton  step  Axnt-  First  we  compute 
the  Cholesky  factorization  of  Ho,  which  costs  (l/3)p3.  We  then  form  the  dense, 
positive  definite  symmetric  matrix  appearing  on  the  lefthand  side  of  (9.61),  which 
costs  2 p2n.  We  then  solve  (9.61)  for  w  using  a  Cholesky  factorization  and  a  back  and 
forward  substitution,  which  costs  (l/3)p3  flops.  Finally,  we  compute  Arrnt  using 
Axnt  =  —D~1(AtL0w  +  g),  which  costs  2 np  flops.  The  total  cost  of  computing 
Axnt  is  (keeping  only  the  dominant  term)  2 p2n  flops,  which  is  far  smaller  than 
(l/3)ra3  for  p«n, 
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Exercises 


Unconstrained  minimization 


9.1  Minimizing  a  quadratic  function.  Consider  the  problem  of  minimizing  a  quadratic 
function: 

minimize  f(x)  =  (1/2  )xT  Px  +  qTx  +  r, 
where  P  £  S”  (but  we  do  not  assume  P  P  0). 

(a)  Show  that  if  P  >£_  0,  i.e.,  the  objective  function  /  is  not  convex,  then  the  problem  is 
unbounded  below. 

(b)  Now  suppose  that  P  P  0  (so  the  objective  function  is  convex),  but  the  optimality 
condition  Px*  =  —q  does  not  have  a  solution.  Show  that  the  problem  is  unbounded 
below. 

9.2  Minimizing  a  quadratic- over-linear  fractional  function.  Consider  the  problem  of  minimiz¬ 
ing  the  function  /  :  R71  — >  R,  defined  as 

f(x)  =  &jj2  ,  dom/  =  {x  |  cT x  +  d  >  0}. 

c1  x  +  a 

We  assume  rank  A  =  n  and  b  qL  1Z(A). 


(a)  Show  that  /  is  closed. 

(b)  Show  that  the  minimizer  x*  of  /  is  given  by 

x*  —  x\  +  tx  2 

where  xi  =  (AT A)~1ATb,  X2  =  (TTd)~1c,  and  t  £  R  can  be  calculated  by  solving 
a  quadratic  equation. 

9.3  Initial  point  and  sublevel  set  condition.  Consider  the  function  f(x)  =  xj+X2  with  domain 
dom/  =  {(xi,X2)  |  xi  >  1}. 

(a)  What  is  p *? 

(b)  Draw  the  sublevel  set  S  =  {x  \  f(x)  <  /(a/0-*)}  for  x ®  =  (2,2).  Is  the  sublevel  set 
S  closed?  Is  /  strongly  convex  on  S'! 

(c)  What  happens  if  we  apply  the  gradient  method  with  backtracking  line  search,  start¬ 
ing  at  a/0^?  Does  f(x ^)  converge  to  p *? 

9.4  Do  you  agree  with  the  following  argument?  The  fi-norm  of  a  vector  x  £  Rm  can  be 
expressed  as 


Therefore  the  £i-norm  approximation  problem 

minimize  ||  Ax  —  6||i 


IMIi  =  (1/2)  inf 
yyo 


^2x2i/yi  + 1 1 


is  equivalent  to  the  minimization  problem 

minimize  f(x,  y )  =  i  (ai  x-bif/yi  +  1  Ty,  (9.62) 

with  dom  f  =  {(x,y)  €  R"  x  Rm  |  y  >-  0},  where  aj  is  the  ith  row  of  A.  Since  /  is  twice 
differentiable  and  convex,  we  can  solve  the  Id-norm  approximation  problem  by  applying 
Newton’s  method  to  (9.62). 

9.5  Backtracking  line  search.  Suppose  /  is  strongly  convex  with  ml  A  V2  f{x)  A  All.  Let 
Ax  be  a  descent  direction  at  x.  Show  that  the  backtracking  stopping  condition  holds  for 


0  <  t  <  - 


V/(*)T  Ax 

M HAsll!  ’ 


Use  this  to  give  an  upper  bound  on  the  number  of  backtracking  iterations. 
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Gradient  and  steepest  descent  methods 

9.6  Quadratic  problem  in  R2.  Verify  the  expressions  for  the  iterates  x ^  in  the  first  example 
of  §9.3.2. 

9.7  Let  A*nsd  and  Axsd  be  the  normalized  and  unnormalized  steepest  descent  directions  at 
x,  for  the  norm  ||  •  |.  Prove  the  following  identities. 

(a)  V/(x)TAxnsd  =  — 1| V/(*)||*. 

(b)  V/(x)TAxsd  =  — ||V/(a;)||*. 

(c)  A;rsd  =  argmin„(V/(x)Tu  +  (l/2)||v||2). 

9.8  Steepest  descent  method  in  loo-norm.  Explain  how  to  find  a  steepest  descent  direction  in 
the  foo-norm,  and  give  a  simple  interpretation. 

Newton’s  method 


9.9  Newton  decrement.  Show  that  the  Newton  decrement  A(x)  satisfies 


w  \  ,  -VT\/f{x) 

\{x)=  sup  (-u  V/  a;  ssup 

,rV2/W»=  i  v^o  (v1  V2/(a»1/2 


9.10  The  pure  Newton  method.  Newton’s  method  with  fixed  step  size  t  =  1  can  diverge  if  the 
initial  point  is  not  close  to  x* .  In  this  problem  we  consider  two  examples. 

(a)  f(x)  =  log^  +  e~x)  has  a  unique  minimizer  x*  =  0.  Run  Newton’s  method  with 
fixed  step  size  t  =  1,  starting  at  a/0-*  =  1  and  at  X®  =  1.1. 

(b)  f(x)  =  —  logx  +  x  has  a  unique  minimizer  x*  =  1.  Run  Newton’s  method  with  fixed 
step  size  t  =  1,  starting  at  x ®  =  3. 

Plot  /  and  /',  and  show  the  first  few  iterates. 

9.11  Gradient  and  Newton  methods  for  composition  functions.  Suppose  <j>  ■  R  — »  R  is  increasing 
and  convex,  and  /  :  R"  — R  is  convex,  so  g(x)  =  </>(f(x))  is  convex.  (We  assume  that 
/  and  g  are  twice  differentiable.)  The  problems  of  minimizing  /  and  minimizing  g  are 
clearly  equivalent. 

Compare  the  gradient  method  and  Newton’s  method,  applied  to  /  and  g.  How  are  the 
search  directions  related?  How  are  the  methods  related  if  an  exact  line  search  is  used? 
Hint.  Use  the  matrix  inversion  lemma  (see  §C.4.3). 

9.12  Trust  region  Newton  method.  If  V2/(x)  is  singular  (or  very  ill-conditioned),  the  Newton 
step  Axnt  =  —  V2/(x)_1  V/(x)  is  not  well  defined.  Instead  we  can  define  a  search  direction 
Axtr  as  the  solution  of 

minimize  (l/2)vTHv  +  gTv 
subject  to  ||v||2  <  7, 

where  H  =  V2/(x),  g  =  V/(x),  and  7  is  a  positive  constant.  The  point  x+Axn  minimizes 
the  second-order  approximation  of  /  at  x,  subject  to  the  constraint  that  ||  (x+ Axtr)— *||2  < 
7.  The  set  {w  |  ||w||2  <  7}  is  called  the  trust  region.  The  parameter  7,  the  size  of  the  trust 
region,  reflects  our  confidence  in  the  second-order  model. 

Show  that  Axtr  minimizes 

(1/2  )vT  Hv  +  gT  v  +  /3|M||, 

for  some  jd.  This  quadratic  function  can  be  interpreted  as  a  regularized  quadratic  model 
for  /  around  x. 
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Self-concordance 


9.13  Self- concordance  and  the  inverse  barrier. 


(a)  Show  that  f(x)  =  1/x  with  domain  (0,8/9)  is  self-concordant. 


(b)  Show  that  the  function 


fix)  =  a'^2 

i= 1 


l 

bi  —  ajx 


with  dom  /  =  {x  £  Rn  \  aj x  <  bi,  i  =  1, . . .  ,m},  is  self-concordant  if  dom  /  is 
bounded  and 

a  >  (9/8)  max  sup  (bi  —  ajx). 

xgdom  / 


9.14  Composition  with  logarithm.  Let  g  :  R  ->  R  be  a  convex  function  with  domg  =  R++, 
and 

\g>"{x)\<3iLM 

for  all  x.  Prove  that  f(x)  =  —  log (-g(x))  —  log*  is  self-concordant  on  {*  |  x  >  0,  g(x)  < 
0}.  Hint.  Use  the  inequality 


q  3  o  q 

+  q  +  g  +  r 


< 


l 


which  holds  for  p,  q,  r  £  R+  with  p2  +  q2  +  r2  =  1. 

9.15  Prove  that  the  following  functions  are  self-concordant.  In  your  proof,  restrict  the  function 
to  a  line,  and  apply  the  composition  with  logarithm  rule. 

(a)  f{x,  y)  =  -  log(y2  -  xTx)  on  {(*,  y)  \  ||*||2  <  y}. 

(b)  fix,y)  =  —2  log  y  —  log(y2/p  -  x2),  with  p  >  1,  on  {( x,y )  £  R2  |  |*|p  <  y}. 

(c)  fix,  y)  =  —  log  y  —  log(log  y  —  x)  on  {( x,y )  |  ex  <  y}. 


9.16  Let  /  :  R  R  be  a  self-concordant  function. 

(a)  Suppose  /”(*)  ^  0.  Show  that  the  self-concordance  condition  (9.41)  can  be  ex¬ 
pressed  as 

I 

Find  the  ‘extreme’  self-concordant  functions  of  one  variable,  i.e.,  the  functions  / 
and  /  that  satisfy 


d 

dx 


(/"0r)-1/2)  =  1, 


respectively. 

(b)  Show  that  either  f"ix)  =  0  for  all  x  £  dom/,  or  f"(x)  >  0  for  all  x  £  dom/. 

9.17  Upper  and  lower  bounds  on  the  Hessian  of  a  self-concordant  function. 

(a)  Let  /  :  R2  —¥  R  be  a  self-concordant  function.  Show  that 


d3fjx) 

d3Xi 

d3f(x) 

dx2dxj 


< 


< 


i  =  1,2, 


n82f(x)  fd2f(x)\1/2 
dx2  {  dx2  ) 


for  all  x  £  dom  /. 
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Hint.  If  h  :  R2  x  R2  x  R2  — >  R  is  a  symmetric  trilinear  form,  i.e., 

h(u,v,w)  =  aiuiviwi  +  a,2(uiViW2  +  U1V2W1  +  U2V1W1) 

+  Cl3(uiV2W2  +  U2V1W1  +  U2V2W1)  +  (I4.U2V2W2, 


then 


sup 

u,v  ,w^0 


h(u,  v,  w ) 


sup 

u^O 


h(u,  u,  u) 


(b)  Let  /  :  R"  ->  R  be  a  self-concordant  function.  Show  that  the  nullspace  of  V2  f(x) 
is  independent  of  x.  Show  that  if  /  is  strictly  convex,  then  V2/( x)  is  nonsingular 
for  all  x  G  dom /. 

Hint.  Prove  that  if  wTV2  f{x)w  =  0  for  some  x  G  dom  /,  then  wTV2  f(y)w  =  0  for 
all  y  G  dom  /.  To  show  this,  apply  the  result  in  (a)  to  the  self- concordant  function 
f{t,  s)  =  f(x  +  t(y  -x)  +  sw). 

(c)  Let  /  :  R"  — >  R  be  a  self-concordant  function.  Suppose  x  G  dom/,  v  G  Rn.  Show 
that 

(1  -  ta)2V2f(x)  ±  V2f(x  +  tv)  ±  (1_1to)2V2/(a;) 
for  *  +  tv  G  dom  /,  0  <  t  <  a,  where  a  =  (wTV2/(*)u)1^2. 


9.18  Quadratic  convergence.  Let  /  :  Rn  ->  R  be  a  strictly  convex  self-concordant  function. 
Suppose  \(x)  <  1,  and  define  x+  =  x  —  V2 /(*)^1V f(x).  Prove  that  A(*+)  <  A(a;)2/(1  — 
A(*))2.  Hint.  Use  the  inequalities  in  exercise  9.17,  part  (c). 

9.19  Bound  on  the  distance  from  the  optimum.  Let  /  :  Rn  ->  R  be  a  strictly  convex  self- 
concordant  function. 


(a)  Suppose  A(*)  <  1  and  the  sublevel  set  {x  \  f(x)  <  f(x)}  is  closed.  Show  that  the 
minimum  of  /  is  attained  and 


((*  -  x*)TV2f(x)(x  -  x*))1/2  < 


1  —  A(*) 


(b)  Show  that  if  /  has  a  closed  sublevel  set,  and  is  bounded  below,  then  its  minimum  is 
attained. 


9.20  Conjugate  of  a  self- concordant  function.  Suppose  /  :  R"  — >  R  is  closed,  strictly  convex, 
and  self-concordant.  We  show  that  its  conjugate  (or  Legendre  transform)  /*  is  self- 
concordant. 

(a)  Show  that  for  each  y  G  dom  /* ,  there  is  a  unique  x  G  dom  /  that  satisfies  y  = 
V/(x).  Hint.  Refer  to  the  result  of  exercise  9.19. 

(b)  Suppose  y  =  V/(*).  Define 

g(t)  =  f{x  +  tv),  h(t)  =  f*  (y  +  tw) 
where  v  G  R"  and  w  =  V2  f{x)v.  Show  that 

s"(0)=ft"(0),  <r(0)  =  -h"'(0). 

Use  these  identities  to  show  that  /*  is  self-concordant. 

9.21  Optimal  line  search  parameters.  Consider  the  upper  bound  (9.56)  on  the  number  of 
Newton  iterations  required  to  minimize  a  strictly  convex  self-concordant  functions.  What 
is  the  minimum  value  of  the  upper  bound,  if  we  minimize  over  a  and  /3? 

9.22  Suppose  that  /  is  strictly  convex  and  satisfies  (9.42).  Give  a  bound  on  the  number  of 
Newton  steps  required  to  compute  p *  within  e,  starting  at  xJ0> . 
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Implementation 

9.23  Pre- computation  for  line  searches.  For  each  of  the  following  functions,  explain  how  the 
computational  cost  of  a  line  search  can  be  reduced  by  a  pre-computation.  Give  the  cost 
of  the  pre-computation,  and  the  cost  of  evaluating  g(t)  =  f(x  +  tAx)  and  g'(t )  with  and 
without  the  pre-computation. 

(a)  f(x)  =  -  J2T=i  lo8(&i  ~  of  x). 

(b)  f(x)  =  log  (X)”i  exp (afx  +  bi)). 

(c)  f{x)  =  (Ax  -  b)T(P0  +  XI Pi  +  •  •  •  +  XnPnj-^Ax  -  b) ,  where  Pi  G  S m ,  A  £  Rmxn, 
b  £  Rm  and  dom  f  =  {x  \  Po  +  Y^i=i  xiP-  ^  0}. 

9.24  Exploiting  block  diagonal  structure  in  the  Newton  system.  Suppose  the  Hessian  V2  f(x)  of 
a  convex  function  /  is  block  diagonal.  How  do  we  exploit  this  structure  when  computing 
the  Newton  step?  What  does  it  mean  about  /? 

9.25  Smoothed  fit  to  given  data.  Consider  the  problem 

minimize  f(x)  =  Y^=i  -  Vi)  +  A  Y^i=i(xi+i  ~ 

where  A  >  0  is  smoothing  parameter,  ip  is  a  convex  penalty  function,  and  x  £  R’1  is  the 
variable.  We  can  interpret  x  as  a  smoothed  fit  to  the  vector  y. 

(a)  What  is  the  structure  in  the  Hessian  of  /? 

(b)  Extend  to  the  problem  of  making  a  smooth  fit  to  two-dimensional  data,  i.e.,  mini¬ 
mizing  the  function 

n  / n — 1  n  n  n—1  \ 

^2  ~  yp ) + A  ( X!  _  Xij )2 + Yl(xi’j+i  ~  )  > 

i,j=l  \i=  1  j  =  1  i= 1  j  =  1  / 


with  variable  X  £  Rnx",  where  Y  £  RuX71  and  A  >  0  are  given. 


9.26 


Newton  equations  with  linear  structure.  Consider  the  problem  of  minimizing  a  function 
of  the  form 

N 

f{x)  =  ^  i(AiX  +  bi)  (9.63) 

i=  1 


where  Ai  £  Rm*xn,  bi  £  Rm‘,  and  the  functions  ipi  :  Rm*  — »  R  are  twice  differentiable 
and  convex.  The  Hessian  H  and  gradient  g  of  /  at  x  are  given  by 


N  N 

H  =  J2A?HiAi’  g  =  J2A<9i-  (9-64) 

i= 1  i= 1 


where  Hi  =  V2ipi(AiX  +  bi)  and  gi  =  Vipi(AiX  +  bi). 

Describe  how  you  would  implement  Newton’s  method  for  minimizing  /.  Assume  that 
n  mi,  the  matrices  Ai  are  very  sparse,  but  the  Hessian  H  is  dense. 

9.27  Analytic  center  of  linear  inequalities  with  variable  bounds.  Give  the  most  efficient  method 
for  computing  the  Newton  step  of  the  function 


f(x)  =  -  ^  log(®*  +  !)  -  ^2  ~  xi)  ~  ^22  log(&i 

i=  1  i= 1  i=  1 


with  dom  /  =  {a;  £  R™  |  —  1  -<  x  -<  1,  Ax  -<  6},  where  aj  is  the  ith  row  of  A.  Assume  A 
is  dense,  and  distinguish  two  cases:  m  >  n  and  m  <  n.  (See  also  exercise  9.30.) 
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9.28  Analytic  center  of  quadratic  inequalities.  Describe  an  efficient  method  for  computing  the 
Newton  step  of  the  function 

m 

f(x)  =  -  ^log(-xT^x  —  bj x  —  a ), 

i=  1 

with  dom/  =  {x  \  xT AiX  +  bf  x  +  a  <  0,  i  =  1, ,  m}.  Assume  that  the  matrices 
Ai  £  S”+  are  large  and  sparse,  and  m  <C  n. 

Hint.  The  Hessian  and  gradient  of  /  at  x  are  given  by 

m  m 

H  =  y^(2 ajAj  +  cti(2AiX  +  bi)(2AiX  +  bi)T),  g  =  ^  on[2AiX  +  bf), 

i=l  i=  1 

where  a,  =  l/(— xT AiX  —  bj x  —  d). 

9.29  Exploiting  structure  in  two-stage  optimization.  This  exercise  continues  exercise  4.64,  which 
describes  optimization  with  recourse,  or  two-stage  optimization.  Using  the  notation  and 
assumptions  in  exercise  4.64,  we  assume  in  addition  that  the  cost  function  /  is  a  twice 
differentiable  function  of  (x,  z),  for  each  scenario  i  =  1, . . . ,  S . 

Explain  how  to  efficiently  compute  the  Newton  step  for  the  problem  of  finding  the  optimal 
policy.  How  does  the  approximate  flop  count  for  your  method  compare  to  that  of  a  generic 
method  (which  exploits  no  structure),  as  a  function  of  S,  the  number  of  scenarios? 

Numerical  experiments 

9.30  Gradient  and  Newton  methods.  Consider  the  unconstrained  problem 

minimize  f(x)  =  -  Y^iLi  log!1  ~  aj x)  -  X)"=i  1°g(1  “  x?)’ 

with  variable  x  £  Rn,  and  dom  /  =  {x  \  aj x  <  1,  i  =  1, . . . ,  m,  \xi\  <1,  i  =  1, . . . ,  n}. 
This  is  the  problem  of  computing  the  analytic  center  of  the  set  of  linear  inequalities 

aj  x  <  1,  i  =  l,...,m7  \xi\  <  1,  i  =  l,...,n. 

Note  that  we  can  choose  x^  =  0  as  our  initial  point.  You  can  generate  instances  of  this 
problem  by  choosing  a;  from  some  distribution  on  R”. 

(a)  Use  the  gradient  method  to  solve  the  problem,  using  reasonable  choices  for  the  back¬ 
tracking  parameters,  and  a  stopping  criterion  of  the  form  ||V/(a:)||2  <  »7-  Plot  the 
objective  function  and  step  length  versus  iteration  number.  (Once  you  have  deter¬ 
mined  p *  to  high  accuracy,  you  can  also  plot  /  —  p*  versus  iteration.)  Experiment 
with  the  backtracking  parameters  a  and  /?  to  see  their  effect  on  the  total  number  of 
iterations  required.  Carry  these  experiments  out  for  several  instances  of  the  problem, 
of  different  sizes. 

(b)  Repeat  using  Newton’s  method,  with  stopping  criterion  based  on  the  Newton  decre¬ 
ment  A2.  Look  for  quadratic  convergence.  You  do  not  have  to  use  an  efficient  method 
to  compute  the  Newton  step,  as  in  exercise  9.27;  you  can  use  a  general  purpose  dense 
solver,  although  it  is  better  to  use  one  that  is  based  on  a  Cholesky  factorization. 

Hint.  Use  the  chain  rule  to  find  expressions  for  V/(x)  and  V2/(*). 

9.31  Some  approximate  Newton  methods.  The  cost  of  Newton’s  method  is  dominated  by  the 
cost  of  evaluating  the  Hessian  V2/(*)  and  the  cost  of  solving  the  Newton  system.  For  large 
problems,  it  is  sometimes  useful  to  replace  the  Hessian  by  a  positive  definite  approximation 
that  makes  it  easier  to  form  and  solve  for  the  search  step.  In  this  problem  we  explore 
some  common  examples  of  this  idea. 

For  each  of  the  approximate  Newton  methods  described  below,  test  the  method  on  some 
instances  of  the  analytic  centering  problem  described  in  exercise  9.30,  and  compare  the 
results  to  those  obtained  using  the  Newton  method  and  gradient  method. 
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(a)  Re-using  the  Hessian.  We  evaluate  and  factor  the  Hessian  only  every  N  iterations, 
where  N  >  1,  and  use  the  search  step  Ax  =  —H^1Vf(x),  where  H  is  the  last  Hessian 
evaluated.  (We  need  to  evaluate  and  factor  the  Hessian  once  every  N  steps;  for  the 
other  steps,  we  compute  the  search  direction  using  back  and  forward  substitution.) 

(b)  Diagonal  approximation.  We  replace  the  Hessian  by  its  diagonal,  so  we  only  have 
to  evaluate  the  n  second  derivatives  d2  f{x)/dx2 ,  and  computing  the  search  step  is 
very  easy. 

9.32  Gauss-Newton  method  for  convex  nonlinear  least-squares  problems.  We  consider  a  (non¬ 
linear)  least-squares  problem,  in  which  we  minimize  a  function  of  the  form 

m 

f(x )  =  \  Mx)2> 

i=  1 

where  fi  are  twice  differentiable  functions.  The  gradient  and  Hessian  of  /  at  x  are  given 
by 


V/(z)  =  Y  /.WW.W,  V2/0r)  =  Y  (V/iOzOV/^f  +  . 

i=  1  i=  1 

We  consider  the  case  when  /  is  convex.  This  occurs,  for  example,  if  each  fi  is  either 
nonnegative  and  convex,  or  nonpositive  and  concave,  or  affine. 

The  Gauss-Newton  method  uses  the  search  direction 

Axgn  =  -  ^  V/,(x)V/,(i)T  j  ^  /i(aOV/i(aO^  . 

(We  assume  here  that  the  inverse  exists,  i.e.,  the  vectors  V/i(*), . . .  ,  V  fm{x)  span  Rn.) 
This  search  direction  can  be  considered  an  approximate  Newton  direction  (see  exer¬ 
cise  9.31),  obtained  by  dropping  the  second  derivative  terms  from  the  Hessian  of  /. 

We  can  give  another  simple  interpretation  of  the  Gauss-Newton  search  direction  Aa;gn. 
Using  the  first-order  approximation  fi(x  +  v)  «  fi(x)  +  V fi{x)Tv  we  obtain  the  approxi¬ 
mation 

m 

f{x  +  V)  «  ^  ^(/i(s)  +  V  fi(x)T v)2 . 

i-l 

The  Gauss-Newton  search  step  Aa;gn  is  precisely  the  value  of  v  that  minimizes  this  ap¬ 
proximation  of  /.  (Moreover,  we  conclude  that  A*gn  can  be  computed  by  solving  a  linear 
least-squares  problem.) 

Test  the  Gauss-Newton  method  on  some  problem  instances  of  the  form 

fi{x )  =  (l/2)*TA;a;  +  bfx  +  1, 

with  Ai  £  S"+  and  bj A~1b{  <  2  (which  ensures  that  /  is  convex). 


Chapter  10 

Equality  constrained 
minimization 


10.1  Equality  constrained  minimization  problems 

In  this  chapter  we  describe  methods  for  solving  a  convex  optimization  problem 
with  equality  constraints, 

minimize  f(x)  <10  ^ 

subject  to  Ax  =  6,  \  ■  ) 

where  /  :  R”  — >  R  is  convex  and  twice  continuously  differentiable,  and  A  £  Rpxn 
with  rank  A  =  p  <  n.  The  assumptions  on  A  mean  that  there  are  fewer  equality 
constraints  than  variables,  and  that  the  equality  constraints  are  independent.  We 
will  assume  that  an  optimal  solution  x*  exists,  and  use  p *  to  denote  the  optimal 
value,  p*  =  inf{/(a;)  |  Ax  =  b}  =  f(x*). 

Recall  (from  §4.2.3  or  §5.5.3)  that  a  point  x *  £  dom  /  is  optimal  for  (10.1)  if 
and  only  if  there  is  a  v*  £  Rp  such  that 

Ax *  =  b,  V/(  x*)  +  ATv*  =  0.  (10.2) 

Solving  the  equality  constrained  optimization  problem  (10.1)  is  therefore  equivalent 
to  finding  a  solution  of  the  KKT  equations  (10.2),  which  is  a  set  of  n  +  p  equations 
in  the  n+p  variables  x *,  v* .  The  first  set  of  equations,  Ax*  =  b,  are  called 
the  primal  feasibility  equations ,  which  are  linear.  The  second  set  of  equations, 
V/(a;*)  +  ATv*  =  0,  are  called  the  dual  feasibility  equations ,  and  are  in  general 
nonlinear.  As  with  unconstrained  optimization,  there  are  a  few  problems  for  which 
we  can  solve  these  optimality  conditions  analytically.  The  most  important  special 
case  is  when  /  is  quadratic,  which  we  examine  in  §10.1.1. 

Any  equality  constrained  minimization  problem  can  be  reduced  to  an  equiv¬ 
alent  unconstrained  problem  by  eliminating  the  equality  constraints,  after  which 
the  methods  of  chapter  9  can  be  used  to  solve  the  problem.  Another  approach 
is  to  solve  the  dual  problem  (assuming  the  dual  function  is  twice  differentiable) 
using  an  unconstrained  minimization  method,  and  then  recover  the  solution  of  the 
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equality  constrained  problem  (10.1)  from  the  dual  solution.  The  elimination  and 
dual  methods  are  briefly  discussed  in  §10.1.2  and  §10.1.3,  respectively. 

The  bulk  of  this  chapter  is  devoted  to  extensions  of  Newton’s  method  that  di¬ 
rectly  handle  equality  constraints.  In  many  cases  these  methods  are  preferable  to 
methods  that  reduce  an  equality  constrained  problem  to  an  unconstrained  one.  One 
reason  is  that  problem  structure,  such  as  sparsity,  is  often  destroyed  by  elimination 
(or  forming  the  dual);  in  contrast,  a  method  that  directly  handles  equality  con¬ 
straints  can  exploit  the  problem  structure.  Another  reason  is  conceptual:  methods 
that  directly  handle  equality  constraints  can  be  thought  of  as  methods  for  directly 
solving  the  optimality  conditions  (10.2). 


10.1.1  Equality  constrained  convex  quadratic  minimization 

Consider  the  equality  constrained  convex  quadratic  minimization  problem 

minimize  /( x)  =  (l/2)xTPx  +  qTx  +  r 

subject  to  Ax  =  b, 


where  P  £  S"  and  A  £  Rpx”.  This  problem  is  important  on  its  own,  and  also 
because  it  forms  the  basis  for  an  extension  of  Newton’s  method  to  equality  con¬ 
strained  problems. 

Here  the  optimality  conditions  (10.2)  are 

Ax*  =  b,  Px *  +  q  +  ATv*  =  0, 

which  we  can  write  as 


'  p  at  i  r  x* 

A0  v* 


(10.4) 


This  set  of  n  +  p  linear  equations  in  the  n  +  p  variables  x*,  v*  is  called  the  KKT 
system  for  the  equality  constrained  quadratic  optimization  problem  (10.3).  The 
coefficient  matrix  is  called  the  KKT  matrix. 

When  the  KKT  matrix  is  nonsingular,  there  is  a  unique  optimal  primal-dual 
pair  (x*,i/*).  If  the  KKT  matrix  is  singular,  but  the  KKT  system  is  solvable,  any 
solution  yields  an  optimal  pair  (x*,v*).  If  the  KKT  system  is  not  solvable,  the 
quadratic  optimization  problem  is  unbounded  below  or  infeasible.  Indeed,  in  this 
case  there  exist  v  £  Rn  and  w  £  Rp  such  that 

Pv  +  Atw  =  0,  Av  =  0,  —  qTv  +  bTw  >  0. 

Let  x  be  any  feasible  point.  The  point  x  =  x  +  tv  is  feasible  for  all  t  and 

f(x  +  tv)  =  f(x)  +  t(vT  Px  +  qTv)  +  (l/2)t2vTPv 

=  f(x)  +  t(—xTATw  +  qTv )  —  (1/2  )t2wT  Av 
=  f(x)  +  t(-bTw  +  qTv), 

which  decreases  without  bound  as  t  — >  oo. 


10.1  Equality  constrained  minimization  problems 
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Nonsingularity  of  the  KKT  matrix 

Recall  our  assumption  that  P  £  S"  and  rank  A  =  p  <  n.  There  are  several 
conditions  equivalent  to  nonsingularity  of  the  KKT  matrix: 

•  A f(P)  r\Af(A)  =  {0},  i.e.,  P  and  A  have  no  nontrivial  common  nullspace. 

•  Ax  =  0,  x  ^  0  =>  xT Px  >0,  i.e.,  P  is  positive  definite  on  the  nullspace  of 
A. 

•  FtPF  >-  0,  where  F  £  Rrax(n-p)  js  a  matrix  for  which  1Z{F)  =  Af(A). 

(See  exercise  10.1.)  As  an  important  special  case,  we  note  that  if  P  >~  0,  the  KKT 
matrix  must  be  nonsingular. 


10.1.2  Eliminating  equality  constraints 


One  general  approach  to  solving  the  equality  constrained  problem  (10.1)  is  to  elim¬ 
inate  the  equality  constraints,  as  described  in  §4.2.4,  and  then  solve  the  resulting 
unconstrained  problem  using  methods  for  unconstrained  minimization.  We  first 
find  a  matrix  F  £  R«x(n-p)  an(j  vec^or  y  g  R”  that  parametrize  the  (affine) 
feasible  set: 

{x  |  Ax  =  b}  =  {Fz  +  x  |  z  £  Rn_p}. 

Here  x  can  be  chosen  as  any  particular  solution  of  Ax  =  b,  and  F  £  R nx(n~p) 
is  any  matrix  whose  range  is  the  nullspace  of  A.  We  then  form  the  reduced  or 
eliminated  optimization  problem 

minimize  f(z)  =  f(Fz  +  x),  (10.5) 


which  is  an  unconstrained  problem  with  variable  z  £  R"^p.  From  its  solution  z*, 
we  can  find  the  solution  of  the  equality  constrained  problem  as  x*  =  F z*  +  x. 

We  can  also  construct  an  optimal  dual  variable  v*  for  the  equality  constrained 
problem,  as 

v*  =  —(AAT)~1AVf(x*). 

To  show  that  this  expression  is  correct,  we  must  verify  that  the  dual  feasibility 
condition 

V/(s*)  +  AT(—(AAT)~1AVf(x*))  =  0  (10.6) 

holds.  To  show  this,  we  note  that 


Ft 

A 


(V/(: r*)  -  AT(AAT)~1A\/f(x*))  =  0, 


where  in  the  top  block  we  use  FTVf(x*)  =  V f(z*)  =  0  and  AF  =  0.  Since  the 
matrix  on  the  left  is  nonsingular,  this  implies  (10.6). 


Example  10.1  Optimal  allocation  with  resource  constraint.  We  consider  the  problem 

minimize  XT=i  /»(*») 
subject  to  Xi  = 


524 


10  Equality  constrained  minimization 


where  the  functions  fi  :  R  — >  R  are  convex  and  twice  differentiable,  and  b  £  R  is 
a  problem  parameter.  We  interpret  this  as  the  problem  of  optimally  allocating  a 
single  resource,  with  a  fixed  total  amount  b  (the  budget)  to  n  otherwise  independent 
activities. 

We  can  eliminate  xn  (for  example)  using  the  parametrization 

xn  =  b  -  xi  - - xn-i, 

which  corresponds  to  the  choices 

x  =  be„,  F  = 

The  reduced  problem  is  then 

minimize  fn(b  -  xi - -  xn-i)  +  fi{xi), 

with  variables  xi, . . . ,  xn-\. 


£  RnX(n_1) 


Choice  of  elimination  matrix 


There  are,  of  course,  many  possible  choices  for  the  elimination  matrix  F,  which  can 
be  chosen  as  any  matrix  in  Rnx(rl_p)  with  7 Z(F)  =  AT (A).  If  F  is  one  such  matrix, 
and  T  £  p_(«-p)x(»-p)  is  nonsingular,  then  F  =  FT  is  also  a  suitable  elimination 
matrix,  since 

K{F)  =K(F)  =Af{A). 


Conversely,  if  F  and  F  are  any  two  suitable  elimination  matrices,  then  there  is 
some  nonsingular  T  such  that  F  =  FT. 

If  we  eliminate  the  equality  constraints  using  F,  we  solve  the  unconstrained 
problem 

minimize  f(Fz  +  x), 


while  if  F  is  used,  we  solve  the  unconstrained  problem 


minimize  f(Fz  +  x)  =  f(F(Tz)  +  x). 


This  problem  is  equivalent  to  the  one  above,  and  is  simply  obtained  by  the  change 
of  coordinates  z  =  Tz.  In  other  words,  changing  the  elimination  matrix  can  be 
thought  of  as  changing  variables  in  the  reduced  problem. 


10.1.3  Solving  equality  constrained  problems  via  the  dual 

Another  approach  to  solving  (10.1)  is  to  solve  the  dual,  and  then  recover  the  optimal 
primal  variable  x*,  as  described  in  §5.5.5.  The  dual  function  of  (10.1)  is 

g{v)  =  -bTv  +  inf  {f{x)  +  vtAx) 

X 

=  ~bTu  -  sup  ((-  ATv)Tx  -  f{x)) 

X 

=  -bTv-  f*(-ATv), 


10.2  Newton’s  method  with  equality  constraints 
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where  /*  is  the  conjugate  of  /,  so  the  dual  problem  is 

maximize  —bTv  —  f*[—ATu). 

Since  by  assumption  there  is  an  optimal  point,  the  problem  is  strictly  feasible,  so 
Slater’s  condition  holds.  Therefore  strong  duality  holds,  and  the  dual  optimum  is 
attained,  i.e.,  there  exists  a  v*  with  g{v*)  =p*. 

If  the  dual  function  g  is  twice  differentiable,  then  the  methods  for  unconstrained 
minimization  described  in  chapter  9  can  be  used  to  maximize  g.  (In  general,  the 
dual  function  g  need  not  be  twice  differentiable,  even  if  /  is.)  Once  we  find  an 
optimal  dual  variable  v* ,  we  reconstruct  an  optimal  primal  solution  x *  from  it. 
(This  is  not  always  straightforward;  see  §5.5.5.) 


Example  10.2  Equality  constrained  analytic  center.  We  consider  the  problem 

minimize  f(x)  =  —  log** 
subject  to  Ax  =  b, 


where  A  £  Rpxn,  with  implicit  constraint  x  0.  Using 


n  n 

f*{y)  =  “  log(-2/0)  =  -n  -  ^2  log  (-Vi) 

i= 1  i=  1 

(with  dom  /*  =  —  R"  +  ),  the  dual  problem  is 

maximize  g(u)  =  —bTv  +  n  +  log(j4Tiz)i,  (10.8) 

with  implicit  constraint  ATu  y  0.  Here  we  can  easily  solve  the  dual  feasibility 
equation,  i.e.,  find  the  x  that  minimizes  L(x,u): 

V/(*)  +  AT v  =  —  (1/ £Ci ,  •  •  • ,  1  / Xn )  +  AT v  =  0, 


and  so 

Xi(v)  =  l/(ATv)i.  (10.9) 

To  solve  the  equality  constrained  analytic  centering  problem  (10.7),  we  solve  the 
(unconstrained)  dual  problem  (10.8),  and  then  recover  the  optimal  solution  of  (10.7) 
via  (10.9). 


10.2  Newton’s  method  with  equality  constraints 

In  this  section  we  describe  an  extension  of  Newton’s  method  to  include  equality 
constraints.  The  method  is  almost  the  same  as  Newton’s  method  without  con¬ 
straints,  except  for  two  differences:  The  initial  point  must  be  feasible  [i.e.,  satisfy 
x  £  dom  /  and  Ax  =  b),  and  the  definition  of  Newton  step  is  modified  to  take 
the  equality  constraints  into  account.  In  particular,  we  make  sure  that  the  Newton 
step  Axnt  is  a  feasible  direction,  i.e.,  AAxrA  =  0. 
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10.2.1  The  Newton  step 


Definition  via  second-order  approximation 

To  derive  the  Newton  step  Arrnt  for  the  equality  constrained  problem 

minimize  f(x) 
subject  to  Ax  =  6, 


at  the  feasible  point  x,  we  replace  the  objective  with  its  second-order  Taylor  ap¬ 
proximation  near  x,  to  form  the  problem 


minimize  f(x  +  v)  =  f(x)  +  V/(  x)T  v  +  (1/2  )vTV2  f(x)v 
subject  to  A(x  +  v)  =  b, 


(10.10) 


with  variable  v.  This  is  a  (convex)  quadratic  minimization  problem  with  equality 
constraints,  and  can  be  solved  analytically.  We  define  Axnt,  the  Newton  step  at  x, 
as  the  solution  of  the  convex  quadratic  problem  (10.10),  assuming  the  associated 
KKT  matrix  is  nonsingular.  In  other  words,  the  Newton  step  Aa;nt  is  what  must 
be  added  to  x  to  solve  the  problem  when  the  quadratic  approximation  is  used  in 
place  of  /. 

From  our  analysis  in  §10.1.1  of  the  equality  constrained  quadratic  problem,  the 
Newton  step  Arrnt  is  characterized  by 


V2/(z) 

A 


Axnt 

'  —  V/(ar)  ' 

w 

0 

(10.11) 


where  w  is  the  associated  optimal  dual  variable  for  the  quadratic  problem.  The 
Newton  step  is  defined  only  at  points  for  which  the  KKT  matrix  is  nonsingular. 

As  in  Newton’s  method  for  unconstrained  problems,  we  observe  that  when  the 
objective  /  is  exactly  quadratic,  the  Newton  update  x  +  A;rnt  exactly  solves  the 
equality  constrained  minimization  problem,  and  in  this  case  the  vector  w  is  the  op¬ 
timal  dual  variable  for  the  original  problem.  This  suggests,  as  in  the  unconstrained 
case,  that  when  /  is  nearly  quadratic,  x  +  Axnt  should  be  a  very  good  estimate  of 
the  solution  x*,  and  w  should  be  a  good  estimate  of  the  optimal  dual  variable  v* . 


Solution  of  linearized  optimality  conditions 

We  can  interpret  the  Newton  step  Aa;nt,  and  the  associated  vector  w,  as  the  solu¬ 
tions  of  a  linearized  approximation  of  the  optimality  conditions 

Ax *  =  6,  V/ (a;*)  +  ATv *  =  0. 

We  substitute  x  +  Aa:nt  for  x*  and  w  for  i/*,  and  replace  the  gradient  term  in  the 
second  equation  by  its  linearized  approximation  near  x,  to  obtain  the  equations 

A(x  +  Axnt)  =  b ,  V/(a;  +  Aajnt)  +  ATw  w  V/( x)  +  V2f(x) Axnt  +  ATw  =  0. 

Using  Ax  =  6,  these  become 

AAx^t  =  0,  V2f(x)  Axnt  +  ATw  =  -V/(  x), 

which  are  precisely  the  equations  (10.11)  that  define  the  Newton  step. 


10.2  Newton’s  method  with  equality  constraints 
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The  Newton  decrement 

We  define  the  Newton  decrement  for  the  equality  constrained  problem  as 

A  (a;)  =  (Ax^tV2/(a;)Aa;nt)1/2.  (10.12) 

This  is  exactly  the  same  expression  as  (9.29),  used  in  the  unconstrained  case,  and 
the  same  interpretations  hold.  For  example,  \{x)  is  the  norm  of  the  Newton  step, 
in  the  norm  determined  by  the  Hessian. 

Let  ^ 

f(x  +  v)  =  f(x)  +  V/(  x)T  v  +  (1/2  )vTV2  f(x)v 

be  the  second-order  Taylor  approximation  of  /  at  x.  The  difference  between  f(x) 
and  the  minimum  of  the  second-order  model  satisfies 

f(x)  —  inf{/(a;  +  v)  \  A{x  +  v)  =  b}  =  A(x)2 /2,  (10.13) 

exactly  as  in  the  unconstrained  case  (see  exercise  10.6).  This  means  that,  as  in  the 
unconstrained  case,  A(x)2/2  gives  an  estimate  of  f(x)  —p*,  based  on  the  quadratic 
model  at  x,  and  also  that  A(x)  (or  a  multiple  of  A(a:)2)  serves  as  the  basis  of  a  good 
stopping  criterion. 

The  Newton  decrement  comes  up  in  the  line  search  as  well,  since  the  directional 
derivative  of  /  in  the  direction  Axnt  is 

=  V/(x)TAa;nt  =  — A(a;)2,  (10.14) 

t=o 

as  in  the  unconstrained  case. 

Feasible  descent  direction 

Suppose  that  Ax  =  b.  We  say  that  v  £  Rn  is  a  feasible  direction  if  Av  =  0.  In  this 
case,  every  point  of  the  form  x  +  tv  is  also  feasible,  i.e.,  A(x  +  tv )  =  b.  We  say  that 
v  is  a  descent  direction  for  /  at  x,  if  for  small  t  >  0,  f(x  +  tv)  <  f(x). 

The  Newton  step  is  always  a  feasible  descent  direction  (except  when  x  is  opti¬ 
mal,  in  which  case  Accnt  =  0).  Indeed,  the  second  set  of  equations  that  define  Aa;nt 
are  AAxnt  =  0,  which  shows  it  is  a  feasible  direction;  that  it  is  a  descent  direction 
follows  from  (10.14). 

Affine  invariance 

Like  the  Newton  step  and  decrement  for  unconstrained  optimization,  the  New¬ 
ton  step  and  decrement  for  equality  constrained  optimization  are  affine  invariant. 
Suppose  T  £  Rnxn  is  nonsingular,  and  define  f(y)  =  f(Ty).  We  have 

V/(y)  =  TrV/(Ty),  V2/(y)  =  TTV2/(Ty)T, 

and  the  equality  constraint  Ax  =  b  becomes  ATy  =  b. 

Now  consider  the  problem  of  minimizing  f(y),  subject  to  ATy  =  b.  The  Newton 
step  A ynt  at  y  is  given  by  the  solution  of 


'  TTV2/(Ty)T  TtAt  ' 

Ay„t 

-TT\7f(Ty)  - 

AT  0 

w 

0 

—  f(x  +  tAxnt) 
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Comparing  with  the  Newton  step  Aa;nt  for  /  at  x  =  Ty,  given  in  (10.11),  we  see 
that 

TAynt  =  Axnt 

(and  w  —  ui),  i.e.,  the  Newton  steps  at  y  and  x  are  related  by  the  same  change  of 
coordinates  as  Ty  =  x. 


10.2.2  Newton’s  method  with  equality  constraints 

The  outline  of  Newton’s  method  with  equality  constraints  is  exactly  the  same  as 
for  unconstrained  problems. 


Algorithm  10.1  Newton’s  method  for  equality  constrained  minimization. 

given  starting  point  x  £  domf  with  Ax  =  b,  tolerance  e  >  0. 

repeat 

1.  Compute  the  Newton  step  and  decrement  Aint,  A(x). 

2.  Stopping  criterion,  quit  if  A2/2  <  e. 

3.  Line  search.  Choose  step  size  t  by  backtracking  line  search. 

4.  Update,  x  :=  x  +  tAxnt- 


The  method  is  called  a  feasible  descent  method ,  since  all  the  iterates  are  feasi¬ 
ble,  with  f(x^k+lS>)  <  /( x^)  (unless  x ^  is  optimal).  Newton’s  method  requires 
that  the  KKT  matrix  be  invertible  at  each  x;  we  will  be  more  precise  about  the 
assumptions  required  for  convergence  in  §10.2.4. 


10.2.3  Newton’s  method  and  elimination 

We  now  show  that  the  iterates  in  Newton’s  method  for  the  equality  constrained 
problem  (10.1)  coincide  with  the  iterates  in  Newton’s  method  applied  to  the  re¬ 
duced  problem  (10.5).  Suppose  F  satisfies  7 Z(F)  =  N (A)  and  rankF  =  n  —  p, 
and  x  satisfies  Ax  =  b.  The  gradient  and  Hessian  of  the  reduced  objective  function 
f(z)  =  f(Fz  +  x)  are 

V/(3)  =  FTS7  f{Fz  +  x),  V2f(z)  =  FTV2f(Fz  +  x)F. 

From  the  Hessian  expression,  we  see  that  the  Newton  step  for  the  equality  con¬ 
strained  problem  is  defined,  i.e.,  the  KKT  matrix 

'  V2/(;r)  ' 

A  0 

is  invertible,  if  and  only  if  the  Newton  step  for  the  reduced  problem  is  defined,  i.e., 
V2  f(z)  is  invertible. 

The  Newton  step  for  the  reduced  problem  is 

A^nt  =  -V7(*)-1V/(2)  =  -(FtV2/(x)F)-1FtV/(x), 


(10.15) 
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where  x  =  Fz  +  x.  This  search  direction  for  the  reduced  problem  corresponds  to 
the  direction 

FAznt  =  -F(FTV2f(x)Fy1FTVf(x) 

for  the  original,  equality  constrained  problem.  We  claim  this  is  precisely  the  same 
as  the  Newton  direction  Accnt  for  the  original  problem,  defined  in  (10.11). 

To  show  this,  we  take  Axnt  =  FA;nt,  choose 

w  =  -(AAT)~1A(Vf(  x)  +  V2f(x)  Ax„t), 

and  verify  that  the  equations  defining  the  Newton  step, 

V2f(x)  Axnt  +  ATw  +  Vf{x)  =  0,  AAxnt  =  0,  (10.16) 

hold.  The  second  equation,  AAxnt  =  0,  is  satisfied  because  AF  =  0.  To  verify  the 
first  equation,  we  observe  that 

(V2/(  x)Axnt  +  ATw  +  V/(x)) 

FT\72f(x)Axnt  +  FtAtw  +  FT\7 f{x)  ' 

AV2f(x)  Axnt  +  AATw  +  AVf(x) 

=  0. 


Since  the  matrix  on  the  left  of  the  first  line  is  nonsingular,  we  conclude  that  (10.16) 
holds. 

In  a  similar  way,  the  Newton  decrement  A(z)  of  /  at  z  and  the  Newton  decrement 
of  /  at  x  turn  out  to  be  equal: 

A(z)2  =  AzntS72f(z)Aznt 

=  Az^tFTV2f(x)FAznt 
=  AxltV2f(x)  Axnt 
=  X(x)2. 


10.2.4  Convergence  analysis 

We  saw  above  that  applying  Newton’s  method  with  equality  constraints  is  exactly 
the  same  as  applying  Newton’s  method  to  the  reduced  problem  obtained  by  elimi¬ 
nating  the  equality  constraints.  Everything  we  know  about  the  convergence  of  New¬ 
ton’s  method  for  unconstrained  problems  therefore  transfers  to  Newton’s  method 
for  equality  constrained  problems.  In  particular,  the  practical  performance  of  New¬ 
ton’s  method  with  equality  constraints  is  exactly  like  the  performance  of  Newton’s 
method  for  unconstrained  problems.  Once  x^  is  near  x* ,  convergence  is  extremely 
rapid,  with  a  very  high  accuracy  obtained  in  only  a  few  iterations. 

Assumptions 

We  make  the  following  assumptions. 
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•  The  sublevel  set  S  =  {x  \  x  £  dom/,  f(x)  <  /( x (°)),  Ax  =  b}  is  closed, 
where  £  dom/  satisfies  Ax^  =  b.  This  is  the  case  if  /  is  closed 
(see  §A.3.3). 

•  On  the  set  S,  we  have  V2f(x)  A  MI,  and 


V2f(x) 

A  0 


(10.17) 


i.e.,  the  inverse  of  the  KKT  matrix  is  bounded  on  S.  (Of  course  the  inverse 
must  exist  in  order  for  the  Newton  step  to  be  defined  at  each  point  in  S.) 

•  For  x,  x  £  S,  V2/  satisfies  the  Lipschitz  condition  ||V2/(a;)  —  V2/(i)||2  < 
L\\x  -  x\\2- 


Bounded  inverse  KKT  matrix  assumption 

The  condition  (10.17)  plays  the  role  of  the  strong  convexity  assumption  in  the 
standard  Newton  method  (§9.5.3,  page  488).  When  there  are  no  equality  con¬ 
straints,  (10.17)  reduces  to  the  condition  ||  V2/(a;)-1||2  <  K  on  S ,  so  we  can  take 
K  =  1/m,  if  V2/( a:)  >r  ml  on  S,  where  m  >  0.  With  equality  constraints,  the 
condition  is  not  as  simple  as  a  positive  lower  bound  on  the  minimum  eigenvalue. 
Since  the  KKT  matrix  is  symmetric,  the  condition  (10.17)  is  that  its  eigenvalues, 
n  of  which  are  positive,  and  p  of  which  are  negative,  are  bounded  away  from  zero. 


Analysis  via  the  eliminated  problem 

The  assumptions  above  imply  that  the  eliminated  objective  function  /,  together 
with  the  associated  initial  point  z^ ,  where  x ^  =  x  +  F z^\  satisfy  the  assump¬ 
tions  required  in  the  convergence  analysis  of  Newton’s  method  for  unconstrained 
problems,  given  in  §9.5.3  (with  different  constants  m,  M,  and  L).  It  follows  that 
Newton’s  method  with  equality  constraints  converges  to  x*  (and  u*  as  well). 

To  show  that  the  assumptions  above  imply  that  the  eliminated  problem  satisfies 
the  assumptions  for  the  unconstrained  Newton  method  is  mostly  straightforward 
(see  exercise  10.4).  Here  we  show  the  one  implication  that  is  tricky:  that  the 
bounded  inverse  KKT  condition,  together  with  the  upper  bound  V2/(a’)  A  MI, 
implies  that  V2  f(z)  >:  ml  for  some  positive  constant  to.  More  specifically  we  will 
show  that  this  inequality  holds  for 


to  = 


Q'min(-F)2 

K2M 


(10.18) 


which  is  positive,  since  F  is  full  rank. 

We  show  this  by  contradiction.  Suppose  that  FTHF  ^  ml,  where  H  =  V2  f(x). 
Then  we  can  find  u,  with  ||w||2  =  1,  such  that  uT FT HFu  <  m,  i.e.,  \\H1/2Fu\\2  < 
to1/2.  Using  AF  =  0,  we  have 


'  H 

AT  ' 

Fu 

'  HFu  ' 

A 

0 

0 

0 
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H  AT  ' 

-1 

>  _ 

Fu 

0 

2  \\Fu\\2 

A  0 

2 

'  HFu  ' 
0 

\\HFu\\2 

2 

Using  ||Fu||2  >  crmin(F)  and 

\\HFu\\2  <  ||fL1/2||2||Jff1/2Fw||2  <  M1'2m1/2, 


we  conclude 


'  H  AT 
A  0 


\\Fu\\2  crmin(F) 
-  \\HFu\\2  MV2m1/2 


using  our  expression  for  m  given  in  (10.18). 


Convergence  analysis  for  self-concordant  functions 

If  /  is  self-concordant,  then  so  is  f(z)  =  f(Fz  +  x).  It  follows  that  if  /  is  self- 
concordant,  we  have  the  exact  same  complexity  estimate  as  for  unconstrained  prob¬ 
lems:  the  number  of  iterations  required  to  produce  a  solution  within  an  accuracy 
e  is  no  more  than 


log,  108,(1/,), 

where  a  and  /3  are  the  backtracking  parameters  (see  (9.56)). 


10.3  Infeasible  start  Newton  method 

Newton’s  method,  as  described  in  §10.2,  is  a  feasible  descent  method.  In  this 
section  we  describe  a  generalization  of  Newton’s  method  that  works  with  initial 
points,  and  iterates,  that  are  not  feasible. 


10.3.1  Newton  step  at  infeasible  points 

As  in  Newton’s  method,  we  start  with  the  optimality  conditions  for  the  equality 
constrained  minimization  problem: 

Ax *  =  b,  V/(s*)  +  ATv*  =  0. 

Let  x  denote  the  current  point,  which  we  do  not  assume  to  be  feasible,  but  we  do 
assume  satisfies  x  £  dom /.  Our  goal  is  to  find  a  step  Ax  so  that  x  +  Aa;  satisfies 
(at  least  approximately)  the  optimality  conditions,  i.e.,  x  +  Ax  rts  x*.  To  do  this 
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we  substitute  x  +  Ax  for  x*  and  w  for  v*  in  the  optimality  conditions,  and  use  the 
first-order  approximation 

S7 f(x  +  Ax)  ~  V/(x)  +  S72  f(x)  Ax 

for  the  gradient  to  obtain 

A(x  +  Ax)  =  b,  V/(  x)  +  V2f(x)Ax  +  ATw  =  0. 

This  is  a  set  of  linear  equations  for  Aa;  and  w, 


V2/(x)  AT  Ax 

A  0  w 


V/( x)  ' 

Ax  —  b 


(10.19) 


The  equations  are  the  same  as  the  equations  (10.11)  that  define  the  Newton  step 
at  a  feasible  point  x,  with  one  difference:  the  second  block  component  of  the 
rightlrand  side  contains  Ax  —  b,  which  is  the  residual  vector  for  the  linear  equality 
constraints.  When  x  is  feasible,  the  residual  vanishes,  and  the  equations  (10.19) 
reduce  to  the  equations  (10.11)  that  define  the  standard  Newton  step  at  a  feasible 
point  x.  Thus,  if  x  is  feasible,  the  step  Ax  defined  by  (10.19)  coincides  with  the 
Newton  step  described  above  (but  defined  only  when  x  is  feasible).  For  this  reason 
we  use  the  notation  Aa:nt  for  the  step  Ax  defined  by  (10.19),  and  refer  to  it  as  the 
Newton  step  at  x,  with  no  confusion. 


Interpretation  as  primal-dual  Newton  step 

We  can  give  an  interpretation  of  the  equations  (10.19)  in  terms  of  a  primal-dual 
method  for  the  equality  constrained  problem.  By  a  primal-dual  method,  we  mean 
one  in  which  we  update  both  the  primal  variable  x,  and  the  dual  variable  v,  in 
order  to  (approximately)  satisfy  the  optimality  conditions. 

We  express  the  optimality  conditions  as  r(x*,v*)  =  0,  where  r  :  R"  x  Rp  -> 
R"  x  Rp  is  defined  as 


r(x,v)  =  (rdnai(x,iy),rpri(x,v)). 

Here 

^duai(®,  v)  =  V/(a;)  +  ATv,  rpri(x,  v)  =  Ax-b 

are  the  dual  residual  and  primal  residual ,  respectively.  The  first-order  Taylor  ap¬ 
proximation  of  r,  near  our  current  estimate  y.  is 

r(y  +  z)&  r(y  +  z)  =  r(y)  +  Dr(y)z, 

where  Dr(y)  £  x (h+p)  js  derivative  of  r,  evaluated  at  y  (see  §A.4.1). 

We  define  the  primal-dual  Newton  step  Ayp(j  as  the  step  3  for  which  the  Taylor 
approximation  r(y  +  z)  vanishes,  i.e., 

Dr(y)Aypd  =  -r(y).  (10.20) 

Note  that  here  we  consider  both  x  and  v  as  variables;  Aypcj  =  (Axpd,  Az/pd)  gives 
both  a  primal  and  a  dual  step. 
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Evaluating  the  derivative  of  r,  we  can  express  (10.20)  as 


■  V2/  Or) 

AT  ' 

AXpd 

^dual 

V/(  x)  +  ATv 

A 

0 

Az/pd 

rPri 

Ax  —  b 

(10.21) 


Writing  v  +  A^p(j  as  v+ ,  we  can  express  this  as 


‘  V2f(x) 

AT  ' 

Axpd 

'  V/ Or)  ‘ 

A 

0 

z/+ 

Ax  —  b 

(10.22) 


which  is  exactly  the  same  set  of  equations  as  (10.19).  The  solutions  of  (10.19), 
(10.21),  and  (10.22)  are  therefore  related  as 


A;rnt  =  Aa;pd,  w  =  v+  =  v  +  At'pd. 


This  shows  that  the  (infeasible)  Newton  step  is  the  same  as  the  primal  part  of 
the  primal-dual  step,  and  the  associated  dual  vector  w  is  the  updated  primal-dual 
variable  v+  =  v  +  Aupd. 

The  two  expressions  for  the  Newton  step  and  dual  variable  (or  dual  step),  given 
by  (10.21)  and  (10.22),  are  of  course  equivalent,  but  each  reveals  a  different  feature 
of  the  Newton  step.  The  equation  (10.21)  shows  that  the  Newton  step  and  the 
associated  dual  step  are  obtained  by  solving  a  set  of  equations,  with  the  primal 
and  dual  residuals  as  the  righthand  side.  The  equation  (10.22),  which  is  how  we 
originally  defined  the  Newton  step,  gives  the  Newton  step  and  the  updated  dual 
variable,  and  shows  that  the  current  value  of  the  dual  variable  is  not  needed  to 
compute  the  primal  step,  or  the  updated  value  of  the  dual  variable. 


Residual  norm  reduction  property 


The  Newton  direction,  at  an  infeasible  point,  is  not  necessarily  a  descent  direction 
for  /.  From  (10.19),  we  note  that 


|/(*  +  fA,„) 


V/(cc)T  Ax 

—AxT  (V2/(a’)Ax  +  ATw) 
—AxTV2f{x)Ax  +  {Ax  —  b)Tw, 


which  is  not  necessarily  negative  (unless,  of  course,  x  is  feasible,  i.e.,  Ax  =  b).  The 
primal-dual  interpretation,  however,  shows  that  the  norm  of  the  residual  decreases 
in  the  Newton  direction,  i.e., 


jt  ||r(y  +  tAypd)\\22 


2r{y)T  Dr{y)Aypd 


Taking  the  derivative  of  the  square,  we  obtain 


-2  r{y)Tr{y). 


dt 


\r{y +  tAypd)\\: 


t= 0 


(10.23) 


This  allows  us  to  use  ||r||2  to  measure  the  progress  of  the  infeasible  start  Newton 
method,  for  example,  in  the  line  search.  (For  the  standard  Newton  method,  we  use 
the  function  value  /  to  measure  progress  of  the  algorithm,  at  least  until  quadratic 
convergence  is  attained.) 
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Full  step  feasibility  property 

The  Newton  step  Axnt  defined  by  (10.19)  has  the  property  (by  construction)  that 

A(x  +  Axnt)  =  b.  (10.24) 

It  follows  that,  if  a  step  length  of  one  is  taken  using  the  Newton  step  Axnt,  the 
following  iterate  will  be  feasible.  Once  x  is  feasible,  the  Newton  step  becomes  a 
feasible  direction,  so  all  future  iterates  will  be  feasible,  regardless  of  the  step  sizes 
taken. 

More  generally,  we  can  analyze  the  effect  of  a  damped  step  on  the  equality 
constraint  residual  rpl-i.  With  a  step  length  t  £  [0,1],  the  next  iterate  is  x+  = 
x  +  tAxnt .  so  the  equality  constraint  residual  at  the  next  iterate  is 

r+ri  =  A(x  +  Axnt t)  -  b  =  (1  -  t)(Ax  -  b)  =  (1  -  t)rpri, 

using  (10.24).  Thus,  a  damped  step,  with  length  t,  causes  the  residual  to  be  scaled 
down  by  a  factor  1  —  t.  Now  suppose  that  we  have  a;(*+1)  =  x W  +  fW Aa;^ ,  for 
i  =  0, . . . ,  k  —  1,  where  Ax^  is  the  Newton  step  at  the  point  xW  £  dom /,  and 
fW  £  [0, 1].  Then  we  have 


r(fc)  =  ^[](l-f(i))j  r(0)’ 

where  r W  =  Ax^  —  b  is  the  residual  of  x^\  This  formula  shows  that  the  primal 
residual  at  each  step  is  in  the  direction  of  the  initial  primal  residual,  and  is  scaled 
down  at  each  step.  It  also  shows  that  once  a  full  step  is  taken,  all  future  iterates 
are  primal  feasible. 


10.3.2  Infeasible  start  Newton  method 

We  can  develop  an  extension  of  Newton’s  method,  using  the  Newton  step  Axnt 
defined  by  (10.19),  with  x^  £  dom/,  but  not  necessarily  satisfying  Ax^  —  b. 
We  also  use  the  dual  part  of  the  Newton  step:  Avnt  =  w  —  v  in  the  notation 
of  (10.19),  or  equivalently,  Ai/nt  =  Ai/pd  in  the  notation  of  (10.21). 


Algorithm  10.2  Infeasible  start  Newton  method. 

given  starting  point  x  £  dom/,  v ,  tolerance  e  >  0,  a  £  (0, 1/2),  /?  £  (0, 1). 

repeat 

1.  Compute  primal  and  dual  Newton  steps  Ai„t,  Aunt- 

2.  Backtracking  line  search  on  || r- 1| 2  - 

t  :=  1. 

while  || r{x  +  tAxnt,  v  +  fAi'nt)!^  >  (1  —  at)\\r(x,  v)\\2, 

3.  Update,  x  :=  x  +  tAxnt,  v  :=  v  +  tAunt. 
until  Ax  =  b  and  ||r(*,^)||2  <  e. 


t  :=  fit . 
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This  algorithm  is  very  similar  to  the  standard  Newton  method  with  feasible  start¬ 
ing  point,  with  a  few  exceptions.  First,  the  search  directions  include  the  extra 
correction  terms  that  depend  on  the  primal  residual.  Second,  the  line  search  is 
carried  out  using  the  norm  of  the  residual,  instead  of  the  function  value  /.  Finally, 
the  algorithm  terminates  when  primal  feasibility  has  been  achieved,  and  the  norm 
of  the  (dual)  residual  is  small. 

The  line  search  in  step  2  deserves  some  comment.  Using  the  norm  of  the  residual 
in  the  line  search  can  increase  the  cost,  compared  to  a  line  search  based  on  the 
function  value,  but  the  increase  is  usually  negligible.  Also,  we  note  that  the  line 
search  must  terminate  in  a  finite  number  of  steps,  since  (10.23)  shows  that  the  line 
search  exit  condition  is  satisfied  for  small  t. 

The  equation  (10.24)  shows  that  if  at  some  iteration  the  step  length  is  chosen  to 
be  one,  the  next  iterate  will  be  feasible.  Thereafter,  all  iterates  will  be  feasible,  and 
therefore  the  search  direction  for  the  infeasible  start  Newton  method  coincides,  once 
a  feasible  iterate  is  obtained,  with  the  search  direction  for  the  (feasible)  Newton 
method  described  in  §10.2. 

There  are  many  variations  on  the  infeasible  start  Newton  method.  For  example, 
we  can  switch  to  the  (feasible)  Newton  method  described  in  §10.2  once  feasibility 
is  achieved.  (In  other  words,  we  change  the  line  search  to  one  based  on  /,  and 
terminate  when  \{x)2 /2  <  e.)  Once  feasibility  is  achieved,  the  infeasible  start  and 
the  standard  (feasible)  Newton  method  differ  only  in  the  backtracking  and  exit 
conditions,  and  have  very  similar  performance. 

Using  infeasible  start  Newton  method  to  simplify  initialization 

The  main  advantage  of  the  infeasible  start  Newton  method  is  in  the  initialization 
required.  If  dom/  =  Rn,  then  initializing  the  (feasible)  Newton  method  simply 
requires  computing  a  solution  to  Ax  =  b,  and  there  is  no  particular  advantage, 
other  than  convenience,  in  using  the  infeasible  start  Newton  method. 

When  dom  /  is  not  all  of  R™,  finding  a  point  in  dom  /  that  satisfies  Ax  =  b 
can  itself  be  a  challenge.  One  general  approach,  probably  the  best  when  dom  /  is 
complex  and  not  known  to  intersect  {z  |  Az  =  &},  is  to  use  a  phase  I  method  (de¬ 
scribed  in  §11.4)  to  compute  such  a  point  (or  verify  that  dom  /  does  not  intersect 
{z  |  Az  =  b}).  But  when  dom  /  is  relatively  simple,  and  known  to  contain  a  point 
satisfying  Ax  =  b ,  the  infeasible  start  Newton  method  gives  a  simple  alternative. 

One  common  example  occurs  when  dom  /  =  R"  + ;  as  in  the  equality  con¬ 
strained  analytic  centering  problem  described  in  example  10.2.  To  initialize  New¬ 
ton’s  method  for  the  problem 

minimize  -  1  log  xt  (10  25) 

subject  to  Ax  =  b,  ^  ' 

requires  finding  a  point  >-  0  with  Ax  =  6,  which  is  equivalent  to  solving  a  stan¬ 
dard  form  LP  feasibility  problem.  This  can  be  carried  out  using  a  phase  I  method, 
or  alternatively,  using  the  infeasible  start  Newton  method,  with  any  positive  initial 
point,  e.g.,  an0)  =  1. 

The  same  trick  can  be  used  to  initialize  unconstrained  problems  where  a  starting 
point  in  dom  /  is  not  known.  As  an  example,  we  consider  the  dual  of  the  equality 
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constrained  analytic  centering  problem  (10.25), 

maximize  g{v)  =  —bTv  +  n  +  X)T=r  log(alTz/)i. 

To  initialize  this  problem  for  the  (feasible  start)  Newton  method,  we  must  find  a 
point  that  satisfies  ATv^  >~  0,  i.e.,  we  must  solve  a  set  of  linear  inequalities. 
This  can  be  done  using  a  phase  I  method,  or  using  an  infeasible  start  Newton 
method,  after  reformulating  the  problem.  We  first  express  it  as  an  equality  con¬ 
strained  problem, 

maximize  —bTv  +  n  +  5^”=i  log  Hi 
subject  to  y  =  ATu , 

with  new  variable  y  £  Rn.  We  can  now  use  the  infeasible  start  Newton  method, 
starting  with  any  positive  y 1°)  (and  any  i/°)). 

The  disadvantage  of  using  the  infeasible  start  Newton  method  to  initialize  prob¬ 
lems  for  which  a  strictly  feasible  starting  point  is  not  known  is  that  there  is  no  clear 
way  to  detect  that  there  exists  no  strictly  feasible  point;  the  norm  of  the  residual 
will  simply  converge,  slowly,  to  some  positive  value.  (Phase  I  methods,  in  contrast, 
can  determine  this  fact  unambiguously.)  In  addition,  the  convergence  of  the  infea¬ 
sible  start  Newton  method,  before  feasibility  is  achieved,  can  be  slow;  see  §11.4.2. 

10.3.3  Convergence  analysis 

In  this  section  we  show  that  the  infeasible  start  Newton  method  converges  to  the 
optimal  point,  provided  certain  assumptions  hold.  The  convergence  proof  is  very 
similar  to  those  for  the  standard  Newton  method,  or  the  standard  Newton  method 
with  equality  constraints.  We  show  that  once  the  norm  of  the  residual  is  small 
enough,  the  algorithm  takes  full  steps  (which  implies  that  feasibility  is  achieved), 
and  convergence  is  subsequently  quadratic.  We  also  show  that  the  norm  of  the 
residual  is  reduced  by  at  least  a  fixed  amount  in  each  iteration  before  the  region 
of  quadratic  convergence  is  reached.  Since  the  norm  of  the  residual  cannot  be 
negative,  this  shows  that  within  a  finite  number  of  steps,  the  residual  will  be  small 
enough  to  guarantee  full  steps,  and  quadratic  convergence. 

Assumptions 

We  make  the  following  assumptions. 

•  The  sublevel  set 

S  =  {(x,v)  |  x  £  dom/,  ||r(x,  ^)||2  <  ||r(x(0),zz(°l)||2}  (10.26) 

is  closed.  If  /  is  closed,  then  ||r||2  is  a  closed  function,  and  therefore  this  con¬ 
dition  is  satisfied  for  any  a;*-0-*  €  dom  /  and  any  i/°)  £  Rp  (see  exercise  10.7). 

•  On  the  set  S,  we  have 

\\Dr(x,v)~1\\2  = 
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for  some  K . 

•  For  (x,is),  (x,P)  €  S,  Dr  satisfies  the  Lipsclritz  condition 

\\Dr(x,  v)  -  Dr(x,  v)\\2  <  L\\{x,  v)  -  (x,  v) ||2. 

(This  is  equivalent  to  V2 f(x)  satisfying  a  Lipsclritz  condition;  see  exer¬ 
cise  10.7.) 

As  we  will  see  below,  these  assumptions  imply  that  dom/  and  {z  \  Az  =  b} 
intersect,  and  that  there  is  an  optimal  point  (. x*,u *). 

Comparison  with  standard  Newton  method 

The  assumptions  above  are  very  similar  to  the  ones  made  in  §10.2.4  (page  529) 
for  the  analysis  of  the  standard  Newton  method.  The  second  and  third  assump¬ 
tions,  the  bounded  inverse  KKT  matrix  and  Lipschitz  condition,  are  essentially  the 
same.  The  sublevel  set  condition  (10.26)  for  the  infeasible  start  Newton  method 
is,  however,  more  general  than  the  sublevel  set  condition  made  in  §10.2.4. 

As  an  example,  consider  the  equality  constrained  maximum  entropy  problem 

minimize  f(x)  =  X)"=i  x%  log  xi 
subject  to  Ax  =  b, 

with  dom  /  =  R"  +  .  The  objective  /  is  not  closed;  it  has  sublevel  sets  that  are  not 
closed,  so  the  assumptions  made  in  the  standard  Newton  method  may  not  hold, 
at  least  for  some  initial  points.  The  problem  here  is  that  the  negative  entropy 
function  does  not  converge  to  oo  as  Xi  — >  0.  On  the  other  hand  the  sublevel  set 
condition  (10.26)  for  the  infeasible  start  Newton  method  does  hold  for  this  problem, 
since  the  norm  of  the  gradient  of  the  negative  entropy  function  does  converge  to 
oo  as  Xi  — >  0.  Thus,  the  infeasible  start  Newton  method  is  guaranteed  to  solve  the 
equality  constrained  maximum  entropy  problem.  (We  do  not  know  whether  the 
standard  Newton  method  can  fail  for  this  problem;  we  are  only  observing  here  that 
our  convergence  analysis  does  not  hold.)  Note  that  if  the  initial  point  satisfies  the 
equality  constraints,  the  only  difference  between  the  standard  and  infeasible  start 
Newton  methods  is  in  the  line  searches,  which  differ  only  during  the  damped  stage. 

A  basic  inequality 

We  start  by  deriving  a  basic  inequality.  Let  y  =  (x,iy)  £  S  with  ||r(2/)||2  ^  0,  and 
let  A ynt  =  (A.xnt,  Ai/nt)  be  the  Newton  step  at  y.  Define 

^max  =  inf{t  >  0  |  y  +  tAynt  £  S}. 

If  y  +  tAynt  €  S  for  all  t  >  0,  we  follow  the  usual  convention  and  define  tmax  =  oo. 
Otherwise,  tmax  is  the  smallest  positive  value  of  t  such  that  || r{y  +  tAynt)||2  = 
||r(y(0^ ) 1 1 2 -  In  particular,  it  follows  that  y  +  tAynt  £  S  for  0  <  t  <  tmax. 

We  will  show  that 


My  +  tAynt)\\2  <  (1  -  i)||r(y)||2  +  (K2L/2)t2\\r(y)\\22 


(10.28) 
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for  0  <  t  <  min{  1 ,  tmax}. 
We  have 


r(y  +  tAynt)  =  r(y)  +  f  Dr(y  +  TtAynt)tAyntdr 

Jo 

=  r{y)  +  tDr(y)Aynt  +  /  (Dr(y  +  rtAynt)  -  Dr(y))tAyntdr 

Jo 

=  r(y)  +  tDr(y)Aynt  +  e 
=  (1  -t)r{y)  +  e, 

using  Dr(y)Aynt  =  — r(y ),  and  defining 

e=  f  {Dr(y  +  rtAynt)  -  Dr(y))tAynt  dr. 

Jo 

Now  suppose  0  <  t  <  tmax,  so  y  +  rtAynt  £  S  for  0  <  r  <  1.  We  can  bound  ||e||2 
as  follows: 

II e|| 2  <  \\tAynt\\2  [  \\Dr(y +  rtAynt)  -  Dr(y)\\2dT 

Jo 

<  ||t At/nt  ||2  J  -i'll rtAynt  || 2  dr 

=  (i/2)<2||Aynt||2 
=  {Ll2)t2\\Dr{y)-'r{y)\\l 

<  (A'2L/2)t2||r(y)||2, 

using  the  Lipschitz  condition  on  the  second  line,  and  the  bound  ||Ur(y)-1||2  <  K 
on  the  last.  Now  we  can  derive  the  bound  (10.28):  For  0  <  t  <  rnin{l,  fmax}, 

\\r(y +  tAynt)\\2  =  ||(1  -  t)r(y)  +  e\\2 

<  (l-t)||r(y)||2  +  ||e||2 

<  (l-t)||r(y)||2  +  (/F2L/2)t2||r 


Damped  Newton  phase 

We  first  show  that  if  ||r(y)||2  >  1/(K2L ),  one  iteration  of  the  infeasible  start 
Newton  method  reduces  ||r||2  by  at  least  a  certain  minimum  amount. 

The  righthand  side  of  the  basic  inequality  (10.28)  is  quadratic  in  t,  and  mono- 
tonically  decreasing  between  t  =  0  and  its  minimizer 


^2L||r(y)||2  ■ 

We  must  have  fmax  >  t,  because  the  opposite  would  imply  ||r(j/  +  tmaxAynt)||2  < 
||r(y)||2,  which  is  false.  The  basic  inequality  is  therefore  valid  at  t  =  t ,  and  therefore 

||r(y  +  fAynt)||2  <  ||r(y)||2  -  l/(2/T2L) 

<  \\r(y)\\2-a/(K2L) 

=  (1  —  af)||r(y)||2, 
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which  shows  that  the  step  length  i  satisfies  the  line  search  exit  condition.  Therefore 
we  have  t  >  fit ,  where  t  is  the  step  length  chosen  by  the  backtracking  algorithm. 
From  t  >  (it  we  have  (from  the  exit  condition  in  the  backtracking  line  search) 


\\r(y  +  tAynt)\\2  < 

< 


{1  -  at)\\r(y)\\2 
(l  -  apt)\\r{y)\\2 


(l _ ^ _ ) 

\  K2L\\r(y)\\J 

My)h~^. 


Ik(y)ll2 


Thus,  as  long  as  we  have  ||r(y)||2  >  1  /(K2L),  we  obtain  a  minimum  decrease  in 
||r||2,  per  iteration,  of  a(i/(K2L).  It  follows  that  a  maximum  of 


\\r(y^)\\2K2L 

afi 


iterations  can  be  taken  before  we  have  ||r(y^fc^)||2  <  1  /(K2L). 


Quadratically  convergent  phase 

Now  suppose  ||r(j/)||2  <  1/(A'2L).  The  basic  inequality  gives 

My  +  tAynt)\\2  <  (1  -  t  +  (l/2)<2)||r(y)||2  (10.29) 

for  0  <  t  <  min{l,  tm ax}-  We  must  have  tmax  >  1,  because  otherwise  it  would  follow 
from  (10.29)  that  \\r(y  +  tmaxAynt)\\2  <  ||r(t/)||2,  which  contradicts  the  definition 
of  <max-  The  inequality  (10.29)  therefore  holds  with  t  =  1,  i.e.,  we  have 

My  +  A|/nt)||2  <  (l/2)||r(y)||2  <  (1  -  a)||r(y)||2. 

This  shows  that  the  backtracking  line  search  exit  criterion  is  satisfied  for  t  =  1, 
so  a  full  step  will  be  taken.  Moreover,  for  all  future  iterations  we  have  ||?’(y)||2  < 
1  /(K2L),  so  a  full  step  will  be  taken  for  all  following  iterations. 

We  can  write  the  inequality  (10.28)  (for  t  =  1)  as 

K2L\\r(y+)\\2  <  ^2L||r(y)||2 ^ 

where  y+  =  y  +  A ynt.  Therefore,  if  r(y+k)  denotes  the  residual  k  steps  after  an 
iteration  in  which  ||r(y)||2  <  1  /K2L,  we  have 

K2L\\r(y+k)\\2  <  ^K2L\\r(y)\\2  ^ 2"  <  ^ 

i.e.,  we  have  quadratic  convergence  of  ||r(y)||2  to  zero. 

To  show  that  the  sequence  of  iterates  converges,  we  will  show  that  it  is  a  Cauchy 
sequence.  Suppose  y  is  an  iterate  satisfying  ||r(j/)||2  <  \/(K2L),  and  y+k  denotes 
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the  /c tli  iterate  after  y.  Since  these  iterates  are  in  the  region  of  quadratic  conver¬ 
gence,  the  step  size  is  one,  so  we  have 


h+k-yh  < 


< 


< 


< 

< 


\\y+k  -  j/+(fc-1)||2  h - b  ||y+  -  y\\2 

\\Dr(y+^)-1r(y+^)  ||2  +  •  •  •  +  | \Driyff1 
K(||r(2/+(fc-1))||2  +  ...  +  ||r(y)||2) 


K\\r(y)hY.{!^A 


i= 0 
k—1 


K\\r(y)hYl  ( o 


i—0 


2 
2i  —  l 


2K\\r(y)\\2 


2 


where  in  the  third  line  we  use  the  assumption  that  || Dr  1 1|2  <  K  for  all  iterates. 
Since  || t ) || 2  converges  to  zero,  we  conclude  y ^  is  a  Cauchy  sequence,  and 
therefore  converges.  By  continuity  of  r,  the  limit  point  y*  satisfies  r(y*)  =  0.  This 
establishes  our  earlier  claim  that  the  assumptions  at  the  beginning  of  this  section 
imply  that  there  is  an  optimal  point 


10.3.4  Convex-concave  games 

The  proof  of  convergence  for  the  infeasible  start  Newton  method  reveals  that  the 
method  can  be  used  for  a  larger  class  of  problems  than  equality  constrained  convex 
optimization  problems.  Suppose  r  :  R"  — >  Rn  is  differentiable,  its  derivative 
satisfies  a  Lipschitz  condition  on  S,  and  \\Dr{x)~1\\2  is  bounded  on  S,  where 

S  =  {x  <E  domr  |  ||r(a:)||2  < 

is  a  closed  set.  Then  the  infeasible  start  Newton  method,  started  at  a:^°\  converges 
to  a  solution  of  r(x)  =  0  in  S'.  In  the  infeasible  start  Newton  method,  we  apply 
this  to  the  specific  case  in  which  r  is  the  residual  for  the  equality  constrained 
convex  optimization  problem.  But  it  applies  in  several  other  interesting  cases.  One 
interesting  example  is  solving  a  convex-concave  game.  (See  §5.4.3  and  exercise  5.25 
for  discussion  of  other,  related  games). 

An  unconstrained  (zero-sum,  two-player)  game  on  Rp  x  R9  is  defined  by  its 
payoff  function  f  :  Rp+?  R.  The  meaning  is  that  player  1  chooses  a  value  (or 
move)  u  €  Rp,  and  player  2  chooses  a  value  (or  move)  v  £  R9;  based  on  these 
choices,  player  1  makes  a  payment  to  player  2,  in  the  amount  f(u,v).  The  goal  of 
player  1  is  to  minimize  this  payment,  while  the  goal  of  player  2  is  to  maximize  it. 

If  player  1  makes  his  choice  u  first,  and  player  2  knows  the  choice,  then  player  2 
will  choose  v  to  maximize  /(w,  v),  which  results  in  a  payoff  of  sup„  /(«.,  v)  (assuming 
the  supremum  is  achieved).  If  player  1  assumes  that  player  2  will  make  this  choice, 
he  should  choose  u  to  minimize  sup Vf(u,v).  The  resulting  payoff,  from  player  1 
to  player  2,  will  then  be 


inf  sup  f(u,v) 


(10.30) 
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(assuming  that  the  supremum  is  achieved).  On  the  other  hand  if  player  2  makes 
the  first  choice,  the  strategies  are  reversed,  and  the  resulting  payoff  from  player  1 
to  player  2  is 

sup  inf  f(u,v).  (10.31) 

V  U 

The  payoff  (10.30)  is  always  greater  than  or  equal  to  the  payoff  (10.31);  the  dif¬ 
ference  between  the  two  payoffs  can  be  interpreted  as  the  advantage  afforded  the 
player  who  makes  the  second  move,  with  knowledge  of  the  other  player’s  move.  We 
say  that  («.*,  v*)  is  a  solution  of  the  game,  or  a  saddle-point  for  the  game,  if  for  all 
u,  v, 

f{u*,v)  <  f(u*,v*)  <  f(u,v*). 

When  a  solution  exists,  there  is  no  advantage  to  making  the  second  move;  f{u *,  v*) 
is  the  common  value  of  both  payoffs  (10.30)  and  (10.31).  (See  exercise  3.14.) 

The  game  is  called  convex- concave  if  for  each  v,  f{u,v)  is  a  convex  function  of 
u,  and  for  each  u,  f(u,  v)  is  a  concave  function  of  v.  When  /  is  differentiable  (and 
convex-concave),  a  saddle-point  for  the  game  is  characterized  by  Vf(u*,v*)  =  0. 


Solution  via  infeasible  start  Newton  method 

We  can  use  the  infeasible  start  Newton  method  to  compute  a  solution  of  a  convex- 
concave  game  with  twice  differentiable  payoff  function.  We  define  the  residual  as 


r(u,  v)  =  V f(u,  v ) 


V uf{u,v ) 
Vvf(u,v)  \  ’ 


and  apply  the  infeasible  start  Newton  method.  In  the  context  of  games,  the  infea¬ 
sible  start  Newton  method  is  simply  called  Newton’s  method  (for  convex-concave 
games) . 

We  can  guarantee  convergence  of  the  (infeasible  start)  Newton  method  provided 
Dr  =  V2/  has  bounded  inverse,  and  satisfies  a  Lipschitz  condition  on  the  sublevel 
set 

S  =  {(u,t>)  G  dom /  I  ||r(u,t>)||2  <  ||r(u(0),u(0))||2}, 

where  id°\  are  the  starting  players’  choices. 

There  is  a  simple  analog  of  the  strong  convexity  condition  in  an  unconstrained 
minimization  problem.  We  say  the  game  with  payoff  function  /  is  strongly  convex- 
concave  if  for  some  m  >  0,  we  have  V2„/(n,  v)  >;  ml  and  V2„/(u,  v)  ^  — ml ,  for 
all  (u,  v)  G  S.  Not  surprisingly,  this  strong  convex-concave  assumption  implies  the 
bounded  inverse  condition  (exercise  10.10). 


10.3.5  Examples 

A  simple  example 

We  illustrate  the  infeasible  start  Newton  method  on  the  equality  constrained  an¬ 
alytic  center  problem  (10.25).  Our  first  example  is  an  instance  with  dimensions 
n  =  100  and  m  =  50,  generated  randomly,  for  which  the  problem  is  feasible  and 
bounded  below.  The  infeasible  start  Newton  method  is  used,  with  initial  primal 
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and  dual  points  a;®  =  1,  =  0,  and  backtracking  parameters  a  =  0.01  and 

/3  =  0.5.  The  plot  in  figure  10.1  shows  the  norms  of  the  primal  and  dual  residu¬ 
als  separately,  versus  iteration  number,  and  the  plot  in  figure  10.2  shows  the  step 
lengths.  A  full  Newton  step  is  taken  in  iteration  8,  so  the  primal  residual  becomes 
(almost)  zero,  and  remains  (almost)  zero.  After  around  iteration  9  or  so,  the  (dual) 
residual  converges  quadratically  to  zero. 

An  infeasible  example 

We  also  consider  a  problem  instance,  of  the  same  dimensions  as  the  example  above, 
for  which  dom  /  does  not  intersect  {z  |  Az  =  b},  i.e.,  the  problem  is  infeasible. 
(This  violates  the  basic  assumption  in  the  chapter  that  problem  (10.1)  is  solvable,  as 
well  as  the  assumptions  made  in  §10.2.4;  the  example  is  meant  only  to  show  what 
happens  to  the  infeasible  start  Newton  method  when  dom  /  does  not  intersect 
{z  |  Az  =  b}.)  The  norm  of  the  residual  for  this  example  is  shown  in  figure  10.3, 
and  the  step  length  in  figure  10.4.  Here,  of  course,  the  step  lengths  are  never  one, 
and  the  residual  does  not  converge  to  zero. 

A  convex-concave  game 

Our  final  example  involves  a  convex-concave  game  on  R100  x  R100,  with  payoff 
function 

f(u,  v)  =  uT Av  +  bTu  +  cTv  —  log(l  —  uTu)  +  log(l  —  vTv),  (10.32) 

defined  on 

dom /  =  {(u,v)  |  uTu  <  1,  vTv  <  1}. 

The  problem  data  A,  b,  and  c  were  randomly  generated.  The  progress  of  the 
(infeasible  start)  Newton  method,  started  at  =  v ^  =  0,  with  backtracking 
parameters  a  =  0.01  and  /?  =  0.5,  is  shown  in  figure  10.5. 


10.4  Implementation 

10.4.1  Elimination 

To  implement  the  elimination  method,  we  have  to  calculate  a  full  rank  matrix  F 
and  an  x  such  that 

{x  |  Ax  =  b}  =  {Fz  +  x  |  z  £  Rn-P}. 

Several  methods  for  this  are  described  in  §C.5. 

10.4.2  Solving  KKT  systems 

In  this  section  we  describe  methods  that  can  be  used  to  compute  the  Newton  step 
or  infeasible  Newton  step,  both  of  which  involve  solving  a  set  of  linear  equations 
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Figure  10.1  Progress  of  infeasible  start  Newton  method  on  an  equality  con¬ 
strained  analytic  centering  problem  with  100  variables  and  50  constraints. 
The  figure  shows  ||rpri||2  (solid  line),  and  ||r'aua.i II2  (dashed  line).  Note  that 
feasibility  is  achieved  (and  maintained)  after  8  iterations,  and  convergence 
is  quadratic,  starting  from  iteration  9  or  so. 


Figure  10.2  Step  length  versus  iteration  number  for  the  same  example  prob¬ 
lem.  A  full  step  is  taken  in  iteration  8,  which  results  in  feasibility  from 
iteration  8  on. 
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Figure  10.3  Progress  of  infeasible  start  Newton  method  on  an  equality  con¬ 
strained  analytic  centering  problem  with  100  variables  and  50  constraints, 
for  which  dom  /  =  does  not  intersect  {z  \  Az  =  6}.  The  figure  shows 
1 1 t* pri || 2  (solid  line),  and  ||rduai||2  (dashed  line).  In  this  case,  the  residuals  do 
not  converge  to  zero. 


Figure  10.4  Step  length  versus  iteration  number  for  the  infeasible  example 
problem.  No  full  steps  are  taken,  and  the  step  lengths  converge  to  zero. 
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Figure  10.5  Progress  of  (infeasible  start)  Newton  method  on  a  convex- 
concave  game.  Quadratic  convergence  becomes  apparent  after  about  5  iter¬ 
ations. 


with  KKT  form 


'  H 

AT  ' 

V 

9 

A 

0 

w 

h 

(10.33) 


Here  we  assume  H  £  S",  and  A  £  I lpxn  with  rank  A  =  p  <  n.  Similar  methods 
can  be  used  to  compute  the  Newton  step  for  a  convex-concave  game,  in  which 
the  bottom  right  entry  of  the  coefficient  matrix  is  negative  semidefinite  (see  exer¬ 
cise  10.13). 


Solving  full  KKT  system 

One  straightforward  approach  is  to  simply  solve  the  KKT  system  (10.33),  which  is 
a  set  of  n  +  p  linear  equations  in  n  +  p  variables.  The  KKT  matrix  is  symmetric, 
but  not  positive  definite,  so  a  good  way  to  do  this  is  to  use  an  LDLT  factorization 
(see  §C.3.3).  If  no  structure  of  the  matrix  is  exploited,  the  cost  is  (l/3)(n  +  p)3 
flops.  This  can  be  a  reasonable  approach  when  the  problem  is  small  (he.,  n  and  p 
are  not  too  large),  or  when  A  and  H  are  sparse. 


Solving  KKT  system  via  elimination 

A  method  that  is  often  better  than  directly  solving  the  full  KKT  system  is  based 
on  eliminating  the  variable  v  (see  §C.4).  We  start  by  describing  the  simplest  case, 
in  which  H  >~  0.  Starting  from  the  first  of  the  KKT  equations 

Hv  +  ATw  =  — g ,  Av  =  —h, 


we  solve  for  v  to  obtain 


v  =  —H~1(g  +  Arw). 
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Substituting  this  into  the  second  KKT  equation  yields  AH  1(g  +  ATw)  =  h,  so  we 
have 

w  =  ( AH~1AT)~1(h  -  AH~1g). 

These  formulas  give  us  a  method  for  computing  v  and  w. 

The  matrix  appearing  in  the  formula  for  w  is  the  Schur  complement  S  of  H  in 
the  KKT  matrix: 

S  =  -AH~1At. 

Because  of  the  special  structure  of  the  KKT  matrix,  and  our  assumption  that  A 
has  rank  p,  the  matrix  S  is  negative  definite. 


Algorithm  10.3  Solving  KKT  system  by  block  elimination. 

given  KKT  system  with  H  y  0. 

1.  Form  H~1AT  and  H~xg. 

2.  Form  Schur  complement  S  =  —AH~1AT. 

3.  Determine  w  by  solving  Sw  =  AH~1g  —  h. 

4.  Determine  v  by  solving  Hv  =  —ATw  —  g. 


Step  1  can  be  done  by  a  Cholesky  factorization  of  H ,  followed  by  p  +  1  solves, 
which  costs  /  +  (p  +  l)s,  where  /  is  the  cost  of  factoring  H  and  s  is  the  cost  of 
an  associated  solve.  Step  2  requires  a  p  x  n  by  n  x  p  matrix  multiplication.  If  we 
exploit  no  structure  in  this  calculation,  the  cost  is  p2n  flops.  (Since  the  result  is 
symmetric,  we  only  need  to  compute  the  upper  triangular  part  of  S.)  In  some  cases 
special  structure  in  A  and  H  can  be  exploited  to  carry  out  step  2  more  efficiently. 
Step  3  can  be  carried  out  by  Cholesky  factorization  of  —S,  which  costs  (1/3 )p3 
flops  if  no  further  structure  of  S  is  exploited.  Step  4  can  be  carried  out  using  the 
factorization  of  H  already  calculated  in  step  1,  so  the  cost  is  2 np  +  s  flops.  The 
total  flop  count,  assuming  that  no  structure  is  exploited  in  forming  or  factoring  the 
Schur  complement,  is 

/  +  ps  +  p2n  +  (1/3  )p3 

flops  (keeping  only  dominant  terms).  If  we  exploit  structure  in  forming  or  factoring 
S,  the  last  two  terms  are  even  smaller. 

If  H  can  be  factored  efficiently,  then  block  elimination  gives  us  a  flop  count 
advantage  over  directly  solving  the  KKT  system  using  an  LDLT  factorization.  For 
example,  if  H  is  diagonal  (which  corresponds  to  a  separable  objective  function), 
we  have  /  =  0  and  s  =  n,  so  the  total  cost  is  p2n+  (1/3 )p3  flops,  which  grows  only 
linearly  with  n.  If  H  is  banded  with  bandwidth  k  <C  n,  then  /  =  nkr ,  s  =  4nfc,  so 
the  total  cost  is  around  nk2  +4 nkp  +  p2n+  (1/3 )p3  which  still  grows  only  linearly 
with  n.  Other  structures  of  H  that  can  be  exploited  are  block  diagonal  (which 
corresponds  to  block  separable  objective  function),  sparse,  or  diagonal  plus  low 
rank;  see  appendix  C  and  §9.7  for  more  details  and  examples. 


Example  10.3  Equality  constrained  analytic  center.  We  consider  the  problem 

minimize  —  1°S  Xi 

subject  to  Ax  =  b. 
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Here  the  objective  is  separable,  so  the  Hessian  at  x  is  diagonal: 

H  =  diag(xj"2, . .  .  ,x~2). 

If  we  compute  the  Newton  direction  using  a  generic  method  such  as  an  LDLT  factor¬ 
ization  of  the  KKT  matrix,  the  cost  is  (l/3)(n  +  p)3  flops. 

If  we  compute  the  Newton  step  using  block  elimination,  the  cost  is  np2  +  (l/3)p3 
flops.  This  is  much  smaller  than  the  cost  of  the  generic  method. 

In  fact  this  cost  is  the  same  as  that  of  computing  the  Newton  step  for  the  dual  prob¬ 
lem,  described  in  example  10.2  on  page  525.  For  the  (unconstrained)  dual  problem, 
the  Hessian  is 

Hduai  =  -ADAt, 

where  D  is  diagonal,  with  Du  =  ( ATu)~ 2 .  Forming  this  matrix  costs  np2  flops,  and 
solving  for  the  Newton  step  by  a  Cholesky  factorization  of  —Hduai  costs  (l/3)p3  flops. 


Example  10.4  Minimum  length  piecewise-linear  curve  subject  to  equality  constraints. 
We  consider  a  piecewise-linear  curve  in  R2  with  knot  points  (0,  0),  (1,  xi),  . . . ,  ( n ,  xn). 
To  find  the  minimum  length  curve  that  satisfies  the  equality  constraints  Ax  =  b,  we 
form  the  problem 

minimize  (l  +  xl)1/2  +  (l  +  (*»+ 1  -  Xi)2)11" 

subject  to  Ax  =  6, 

with  variable  x  £  Rn,  and  A  £  Rpxn.  In  this  problem,  the  objective  is  a  sum  of 
functions  of  pairs  of  adjacent  variables,  so  the  Hessian  H  is  tridiagonal.  Using  block 
elimination,  we  can  compute  the  Newton  step  in  around  p2n  +  (l/3)p3  flops. 


Elimination  with  singular  H 

The  block  elimination  method  described  above  obviously  does  not  work  when  H 
is  singular,  but  a  simple  variation  on  the  method  can  be  used  in  this  more  general 
case.  The  more  general  method  is  based  on  the  following  result:  The  KKT  matrix 
is  nonsingular  if  and  only  if  H  +  ATQA  y  0  for  some  Q  y  0,  in  which  case, 
H  +  AtQA  y  0  for  all  Q  y  0.  (See  exercise  10.1.)  We  conclude,  for  example,  that 
if  the  KKT  matrix  is  nonsingular,  then  H  +  AT A  y  0. 

Let  Q  ^  0  be  a  matrix  for  which  H  +  ATQA  y  0.  Then  the  KKT  system  (10.33) 
is  equivalent  to 


'  h  +  atqa 

AT  ' 

V 

g  +  ATQh 

A 

0 

w 

h 

which  can  be  solved  using  elimination  since  H  +  ATQA  y  0. 


10.4.3  Examples 

In  this  section  we  describe  some  longer  examples,  showing  how  structure  can  be 
exploited  to  efficiently  compute  the  Newton  step.  We  also  include  some  numerical 
results. 
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Equality  constrained  analytic  centering 

We  consider  the  equality  constrained  analytic  centering  problem 

minimize  f(x)  =  —  Yld=  1  log  xi 
subject  to  Ax  =  b. 

(See  examples  10.2  and  10.3.)  We  compare  three  methods,  for  a  problem  of  size 
p  =  100,  n  =  500. 

The  first  method  is  Newton’s  method  with  equality  constraints  (§10.2).  The 
Newton  step  Ax’nt  is  defined  by  the  KKT  system  (10.11): 


'  H 

AT  ' 

-g 

A 

0 

W 

0 

where  H  =  diag(l/xi, . . . ,  1/x^),  and  g  =  — (1/aq, . . . ,  1  /xn).  As  explained  in 
example  10.3,  page  546,  the  KKT  system  can  be  efficiently  solved  by  elimination, 
i.e.,  by  solving 

AH~1Atw  =  - AH~lg , 
and  setting  Axnt  =  — H~1(ATw  +  g).  In  other  words, 

Axnt  =  —  diag  (x)2ATw  +  x, 


where  w  is  the  solution  of 

A  d'mg(x)2  Atw  =  b.  (10.34) 

Figure  10.6  shows  the  error  versus  iteration.  The  different  curves  correspond  to 
four  different  starting  points.  We  use  a  backtracking  line  search  with  a  =  0.1, 
P  =  0.5. 

The  second  method  is  Newton’s  method  applied  to  the  dual 

maximize  g(v)  =  —bTv  +  ^°&{ATv)i  +  n 

(see  example  10.2,  page  525).  Here  the  Newton  step  is  obtained  from  solving 

A  diag(y)2AT  Avnt  =  -6  +  Ay  (10.35) 

where  y  =  (l/(AT^)i, . . . ,  1  /(ATv)n).  Comparing  (10.35)  and  (10.34)  we  see  that 
both  methods  have  the  same  complexity.  In  figure  10.7  we  show  the  error  for  four 
different  starting  points.  We  use  a  backtracking  line  search  with  a  =  0.1,  /?  =  0.5. 

The  third  method  is  the  infeasible  start  Newton  method  of  §10.3,  applied  to 
the  optimality  conditions 

V/(®*)  +  ATv *  =  0,  Ax*  =  b. 

The  Newton  step  is  obtained  by  solving 


'  H 

AT  ' 

Axnt 

g  +  ATv 

A 

0 

A  i/nt 

Ax  —  b 
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Figure  10.6  Error  /( x^)  —  p*  in  Newton’s  method,  applied  to  an  equality 
constrained  analytic  centering  problem  of  size  p  =  100,  n  =  500.  The 
different  curves  correspond  to  four  different  starting  points.  Final  quadratic 
convergence  is  clearly  evident. 


Figure  10.7  Error  \g{v('k'>)  —  p*\  in  Newton’s  method,  applied  to  the  dual  of 
the  equality  constrained  analytic  centering  problem. 
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Figure  10.8  Residual  ||r(a:^,i/fc^)||2  in  the  infeasible  start  Newton  method, 
applied  to  the  equality  constrained  analytic  centering  problem. 


where  H  =  diag(l/xf , . . . ,  l/x2),  and  g  =  —  (1/aq, . . . ,  l/xn).  This  KKT  system 
can  be  efficiently  solved  by  elimination,  at  the  same  cost  as  (10.34)  or  (10.35).  For 
example,  if  we  first  solve 

Adiag(x)2ATw  =  2  Ax  —  b, 
then  Aisnt  and  Axnt  follow  from 

A^nt  =  w  —  v,  Aa;nt  =  x  —  diag(;r)2ATu>. 

Figure  10.8  shows  the  norm  of  the  residual 

r( x,  v)  =  (Vf(x)  +  Atu,  Ax  —  b ) 

versus  iteration,  for  four  different  starting  points.  We  use  a  backtracking  line  search 
with  a  =  0.1,  /3  =  0.5. 

The  figures  show  that  for  this  problem,  the  dual  method  appears  to  be  faster, 
but  only  by  a  factor  of  two  or  three.  It  takes  about  six  iterations  to  reach  the 
region  of  quadratic  convergence,  as  opposed  to  12-15  in  the  primal  method  and 
10-20  in  the  infeasible  start  Newton  method. 

The  methods  also  differ  in  the  initialization  they  require.  The  primal  method 
requires  knowledge  of  a  primal  feasible  point,  i.e.,  satisfying  =  b,  x(°)  >-  0. 

The  dual  method  requires  a  dual  feasible  point,  i.e.,  ATv >-  0.  Depending  on 
the  problem,  one  or  the  other  might  be  more  readily  available.  The  infeasible  start 
Newton  method  requires  no  initialization;  the  only  requirement  is  that  a:*-0)  >-  0. 

Optimal  network  flow 

We  consider  a  connected  directed  graph  or  network  with  n  edges  and  p+  1  nodes. 
We  let  Xj  denote  the  flow  or  traffic  on  arc  j,  with  Xj  >  0  meaning  flow  in  the 
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direction  of  the  arc,  and  Xj  <  0  meaning  flow  in  the  direction  opposite  the  arc. 
There  is  also  a  given  external  source  (or  sink)  flow  Si  that  enters  (if  Sj  >  0)  or 
leaves  (if  s*  <  0)  node  i.  The  flow  must  satisfy  a  conservation  equation,  which 
states  that  at  each  node,  the  total  flow  entering  the  node,  including  the  external 
sources  and  sinks,  is  zero.  This  conservation  equation  can  be  expressed  as  Ax  =  s 
where  A  £  R(P+bxrl  is  the  node  incidence  matrix  of  the  graph, 


1  arc  j  leaves  node  i 
—  1  arc  j  enters  node  i 
0  otherwise. 


The  flow  conservation  equation  Ax  =  s  is  inconsistent  unless  lTs  =  0,  which  we 
assume  is  the  case.  (In  other  words,  the  total  of  the  source  flows  must  equal  the 
total  of  the  sink  flows.)  The  flow  conservation  equations  Ax  =  s  are  also  redundant, 
since  1 T A  =  0.  To  obtain  an  independent  set  of  equations  we  can  delete  any  one 
equation,  to  obtain  Ax  =  b,  where  A  £  Rpxn  is  the  reduced  node  incidence  matrix 
of  the  graph  (be.,  the  node  incidence  matrix  with  one  row  removed)  and  b  £  Rp  is 
reduced  source  vector  (be.,  s  with  the  associated  entry  removed). 

In  summary,  flow  conservation  is  given  by  Ax  =  b ,  where  A  is  the  reduced  node 
incidence  matrix  of  the  graph  and  b  is  the  reduced  source  vector.  The  matrix  A  is 
very  sparse,  since  each  column  has  at  most  two  nonzero  entries  (which  can  only  be 
+1  or  —1). 

We  will  take  traffic  flows  x  as  the  variables,  and  the  sources  as  given.  We 
introduce  the  objective  function 


fix)  = 

i=l 


where  (pi  :  R  — >  R  is  the  flow  cost  function  for  arc  i.  We  assume  that  the  flow  cost 
functions  are  strictly  convex  and  twice  differentiable. 

The  problem  of  choosing  the  best  flow,  that  satisfies  the  flow  conservation  re¬ 
quirement,  is 


minimize  X)"=i 
subject  to  Ax  =  b. 


(10.36) 


Here  the  Hessian  H  is  diagonal,  since  the  objective  is  separable. 

We  have  several  choices  for  computing  the  Newton  step  for  the  optimal  network 
flow  problem  (10.36).  The  most  straightforward  is  to  solve  the  full  KKT  system, 
using  a  sparse  LDLT  factorization. 

For  this  problem  it  is  probably  better  to  compute  the  Newton  step  using  block 
elimination.  We  can  characterize  the  sparsity  pattern  of  the  Schur  complement 
S  =  —AH~1At  in  terms  of  the  graph:  We  have  Sij  ^  0  if  and  only  if  node  i  and 
node  j  are  connected  by  an  arc.  It  follows  that  if  the  network  is  sparse,  be.,  if  each 
node  is  connected  by  an  arc  to  only  a  few  other  nodes,  then  the  Schur  complement 
S  is  sparse.  In  this  case,  we  can  exploit  sparsity  in  forming  S,  and  in  the  associated 
factorization  and  solve  steps,  as  well.  We  can  expect  the  computational  complexity 
of  computing  the  Newton  step  to  grow  approximately  linearly  with  the  number  of 
arcs  (which  is  the  number  of  variables). 
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Optimal  control 

We  consider  the  problem 

minimize  E^i  +  E^o*  V’i  WO) 

subject  to  z(t  +  1)  =  Atz(t)  +  Btu(t),  t  =  0, . . . ,  N  —  1. 

Here 

•  z(t)  £  Rfc  is  the  system  state  at  time  t 

•  u(t)  £  R/  is  the  input  or  control  action  at  time  t 

•  (j>t  :  Rfe  R  is  the  state  cost  function 

•  ipt  :  R;  — R  is  the  input  cost  function 

•  N  is  called  the  time  horizon  for  the  problem. 

We  assume  that  the  input  and  state  cost  functions  are  strictly  convex  and  twice  dif¬ 
ferentiable.  The  variables  in  the  problem  are  ■u(O), . . . ,  u(N  — 1),  and  z(l), . . . ,  z(N). 
The  initial  state  2(0)  is  given.  The  linear  equality  constraints  are  called  the  state 
equations  or  dynamic  evolution  equations.  We  define  the  overall  optimization  vari¬ 
able  x  as 

x=  (u(0),z(l),u(l),...,u(N-  l),z(N))  £  RN(k+l). 

Since  the  objective  is  block  separable  (i.e.,  a  sum  of  functions  of  z(t)  and  u(t)), 
the  Hessian  is  block  diagonal: 

H  =  diag(i?0i  Qi,  •  •  • ,  Rn-i,Qn), 

where 

iZ*  =  VVt(u(f)),  t  =  0, . . .  ,N  —  1,  Qt  =  \/2(f>t(z(t)),  t=l,...,N. 

We  can  collect  all  the  equality  constraints  (i.e.,  the  state  equations)  and  express 
them  as  Ax  =  b  where 

'  -B0  I  0  0  0  0  0  0 

0  -Hi  —Bi  I  0  0  0  0 

0  0  0  -A2  -B2  •••  0  0  0 

A  = 

0  0  0  0  0  -  I  00 

0  0  0  0  0  — Hjv— i  — Bjj— i  I 

A0z(  0) 

0 
0 

b  = 

0 
0 
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The  number  of  rows  of  A  (i.e.,  equality  constraints)  is  Nk. 

Directly  solving  the  KKT  system  for  the  Newton  step,  using  a  dense  LDLT 
factorization,  would  cost 

(l/3)(2  Nk  +  Nlf  =  (l/3)iV3(2fc  +  if 

flops.  Using  a  sparse  LDLT  factorization  would  give  a  large  improvement,  since 
the  method  would  exploit  the  many  zero  entries  in  A  and  H . 

In  fact  we  can  do  better  by  exploiting  the  special  block  structure  of  H  and 
A1  using  block  elimination  to  compute  the  Newton  step.  The  Schur  complement 
S  =  —AH~lAT  turns  out  to  be  block  tridiagonal,  with  k  x  k  blocks: 

S  =  - AH~1At 


'  Sn 

0 

0 

0 

AiQr1 

S22 

QfAl 

0 

0 

0 

A2Q2  1 

S33 

0 

0 

0 

0 

0 

Sn-i,n-  1 

Qn-i-A-n-i 

0 

0 

0 

An—i  Qn_i 

Snn 

where 

Sn  =  -BoR^BZ  ~Q~\ 

Su  =  -A^Q-^Aj^-B^R-^B^-Q-1,  i  =  2,....,N. 

In  particular,  S  is  banded,  with  bandwidth  2k  —  1,  so  we  can  factor  it  in  order 
k3N  flops.  Therefore  we  can  compute  the  Newton  step  in  order  k3 N  flops,  assuming 
k  <C  N.  Note  that  this  grows  linearly  with  the  time  horizon  N,  whereas  for  a  generic 
method,  the  flop  count  grows  like  N3. 

For  this  problem  we  could  go  one  step  further  and  exploit  the  block  tridiagonal 
structure  of  S.  Applying  a  standard  block  tridiagonal  factorization  method  would 
result  in  the  classic  Riccati  recursion  for  solving  a  quadratic  optimal  control  prob¬ 
lem.  Still,  using  only  the  banded  nature  of  S  yields  an  algorithm  that  is  the  same 
order. 


Analytic  center  of  a  linear  matrix  inequality 

We  consider  the  problem 


minimize  /(A)  =  —  logdet  X 
subject  to  tr (AiX)  =  bi,  i  =  l,...,p, 


(10.37) 


where  X  £  S"  is  the  variable,  A;  £  Sn,  bi  £  R,  and  dom  /  =  S"  +  .  The  KKT 
conditions  for  this  problem  are 


- x *  1  +  y~V*Aj  =  0,  tr  (AiX*)=b.l,  i  =  l,...,p. 


(10.38) 


The  dimension  of  the  variable  X  is  n(n  +  l)/2.  We  could  simply  ignore  the 
special  matrix  structure  of  X ,  and  consider  it  as  (vector)  variable  x  £  R”(n+1)/2) 
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and  solve  the  problem  (10.37)  using  a  generic  method  for  a  problem  with  n(n+l)/2 
variables  and  p  equality  constraints.  The  cost  for  computing  a  Newton  step  would 
then  be  at  least 

(1/3)  (n(n  +  l)/2  +p)3 

flops,  which  is  order  n6  in  n.  We  will  see  that  there  are  a  number  of  far  more 
attractive  alternatives. 

A  first  option  is  to  solve  the  dual  problem.  The  conjugate  of  /  is 
f*(Y)  =  logdet(— F)_1  —  n 

with  dom/*  =  —  S"+  (see  example  3.23,  page  92),  so  the  dual  problem  is 

maximize  -bTv  +  log  det(^T=1  z^A*)  +  n,  (10.39) 

with  domain  { v  |  y/f_-i  v.t  At  >-  0}.  This  is  an  unconstrained  problem  with  variable 
v  £  Rp.  The  optimal  X*  can  be  recovered  from  the  optimal  v*  by  solving  the  first 
(dual  feasibility)  equation  in  (10.38),  i.e.,  X *  =  (X^=q 

Let  us  work  out  the  cost  of  computing  the  Newton  step  for  the  dual  prob¬ 
lem  (10.39).  We  have  to  form  the  gradient  and  Hessian  of  g,  and  then  solve  for  the 
Newton  step.  The  gradient  and  Hessian  are  given  by 

^2g(v)ij  =  ~  tr(A-1AjA-1Aj),  i,j  =  l,...,p, 

Vg(y)i  =  tr(A_1Aj)  -  i  =  l...,p, 

where  A  =  To  form  V2g(zz)  and  V g(y)  we  proceed  as  follows.  We 

first  form  A  ( pn 2  flops),  and  A~3Aj  for  each  j  (2 pn3  flops).  Then  we  form  the 
matrix  X2g(v).  Each  of  the  p(p  +  l)/2  entries  of  X/2g(v)  is  the  inner  product  of 
two  matrices  in  S",  each  of  which  costs  n{n  +  1)  flops,  so  the  total  is  (dropping 
dominated  terms)  (l/2)p2rc2  flops.  Forming  V g(u)  is  cheap  since  we  already  have 
the  matrices  A~1Ai.  Finally,  we  solve  for  the  Newton  step  —  X2g(y)~1  X7g(v),  which 
costs  (1/3 )p3  flops.  All  together,  and  keeping  only  the  leading  terms,  the  total  cost 
of  computing  the  Newton  step  is  2 pn3  +  (1/2 )p2n2  +  (1/3 )p3.  Note  that  this  is 
order  n3  in  n,  which  is  far  better  than  the  simple  primal  method  described  above, 
which  is  order  n6. 

We  can  also  solve  the  primal  problem  more  efficiently,  by  exploiting  its  special 
matrix  structure.  To  derive  the  KKT  system  for  the  Newton  step  AXnt  at  a  feasible 
X,  we  replace  X *  in  the  KKT  conditions  by  X  +  AXnt  and  v*  by  w,  and  linearize 
the  first  equation  using  the  first-order  approximation 

(X  +  AXnt)-1  »  X-1  -  X — 1  AX„t X — 1 . 

This  gives  the  KKT  system 

p 

— X-1  +  X-1  AXntX_1  +  WiAj,  =  0,  tr(A,AXnt)  =  0,  i  =  l,...,p. 

i= 1 

(10.40) 

This  is  a  set  of  n(n  +  l)/2  +  p  linear  equations  in  the  variables  AJnt  G  Sn  and 
w  G  Rp.  If  we  solved  these  equations  using  a  generic  method,  the  cost  would  be 
order  n6. 
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We  can  use  block  elimination  to  solve  the  KKT  system  (10.40)  far  more  effi¬ 
ciently.  We  eliminate  the  variable  AXnt,  by  solving  the  first  equation  to  get 


v 


V 


AXnt  =  X  -  X  ^  wt  A,  X  =  A  -  WiXAiX.  (10.41) 

\i=  1  /  i= 1 

Substituting  this  expression  for  AXnt  into  the  other  equation  gives 

p 

tr(A,  AXnt)  =  tr(AjX)  -  ^  wt  tr (AjXAiX)  =0,  j  =  1, . . .  ,p. 


This  is  a  set  of  p  linear  equations  in  w: 


Cw  =  d 


where  Cjj  =  tr (AjXAjX),  di  =  tr(AjX).  The  coefficient  matrix  C  is  symmetric 
and  positive  definite,  so  a  Cholesky  factorization  can  be  used  to  find  w.  Once  we 
have  w,  we  can  compute  AXnt  from  (10.41). 

The  cost  of  this  method  is  as  follows.  We  form  the  products  (2 pn3  flops), 

and  then  form  the  matrix  C.  Each  of  the  p(p  +  l)/2  entries  of  C  is  the  inner 
product  of  two  matrices  in  Rnxn,  so  forming  C  costs  p2n 2  flops.  Then  we  solve 
for  w  =  C_1d,  which  costs  (l/3)p3.  Finally  we  compute  AXnt.  If  we  use  the 
first  expression  in  (10.41),  ie.,  first  compute  the  sum  and  then  pre-  and  post- 
multiply  with  X,  the  cost  is  approximately  pn2  +  3 n3.  All  together,  the  total  cost 
is  2 pn3  +  p2n2  +  (l/3)p3  flops  to  form  the  Newton  step  for  the  primal  problem, 
using  block  elimination.  This  is  far  better  than  the  simple  method,  which  is  order 
n6.  Note  also  that  the  cost  is  the  same  as  that  of  computing  the  Newton  step  for 
the  dual  problem. 
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Exercises 


557 


Exercises 

Equality  constrained  minimization 

10.1  Nonsingularity  of  the  KKT  matrix.  Consider  the  KKT  matrix 

'  P  At 
A  0 

where  P  G  S",  A  G  Rpxn,  and  rank  A  =  p  <  n. 

(a)  Show  that  each  of  the  following  statements  is  equivalent  to  nonsingularity  of  the 
KKT  matrix. 

•  Af(P)nNf(A)  =  {0}. 

•  Ax  =  0,  x  ^  0  =>  xT Px  >  0. 

•  FtPF  y  0,  where  F  G  R,ix("-p)  is  a  matrix  for  which  71(F)  =  M(A). 

•  P  +  AtQA  y  0  for  some  Q  y  0. 

(b)  Show  that  if  the  KKT  matrix  is  nonsingular,  then  it  has  exactly  n  positive  and  p 
negative  eigenvalues. 

10.2  Projected  gradient  method.  In  this  problem  we  explore  an  extension  of  the  gradient  method 
to  equality  constrained  minimization  problems.  Suppose  /  is  convex  and  differentiable, 
and  x  G  dom /  satisfies  Ax  =  6,  where  A  G  Rpxn  with  rank/1  =  p  <  n.  The  Euclidean 
projection  of  the  negative  gradient  —  V/(x)  on  Af(A)  is  given  by 

Axpg  =  argmin  ||  — V/(x)  —  it|| 2 - 

Au= 0 

(a)  Let  (v,  w )  be  the  unique  solution  of 


I  at 

V 

-V/(x)  ' 

A  0 

w 

0 

Show  that  v  =  Aa:pg  and  w  =  argmin^  ||  V/(x)  +  ATy\\2- 

(b)  What  is  the  relation  between  the  projected  negative  gradient  Airpg  and  the  negative 
gradient  of  the  reduced  problem  (10.5),  assuming  FT F  =  /? 

(c)  The  projected  gradient  method  for  solving  an  equality  constrained  minimization 
problem  uses  the  step  Axpg,  and  a  backtracking  line  search  on  /.  Use  the  re¬ 
sults  of  part  (b)  to  give  some  conditions  under  which  the  projected  gradient  method 
converges  to  the  optimal  solution,  when  started  from  a  point  G  dom  /  with 
Ax(0)  =  b. 

Newton’s  method  with  equality  constraints 

10.3  Dual  Newton  method.  In  this  problem  we  explore  Newton’s  method  for  solving  the  dual 
of  the  equality  constrained  minimization  problem  (10.1).  We  assume  that  /  is  twice 
differentiable,  V2/(x)  >-  0  for  all  x  G  dom/,  and  that  for  each  v  G  Rp,  the  Lagrangian 
L(x,  v)  =  f(x)  +  vT (Ax  —  b )  has  a  unique  minimizer,  which  we  denote  x(iy). 

(a)  Show  that  the  dual  function  g  is  twice  differentiable.  Find  an  expression  for  the 
Newton  step  for  the  dual  function  g,  evaluated  at  u,  in  terms  of  /,  V/,  and  V2/, 
evaluated  at  x  =  x(v).  You  can  use  the  results  of  exercise  3.40. 
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(b)  Suppose  there  exists  a  K  such  that 

r  v2/(*)  at  i _i 

A  0 

L  J  2 

for  all  x  £  dom /.  Show  that  g  is  strongly  concave,  with  V2g(v)  A  — (1  /K)I. 

10.4  Strong  convexity  and  Lipschitz  constant  of  the  reduced  problem.  Suppose  /  satisfies  the 
assumptions  given  on  page  529.  Show  that  the  reduced  objective  function  f(z)  =  f(Fz+x) 
is  strongly  convex,  and  that  its  Hessian  is  Lipschitz  continuous  (on  the  associated  sublevel 
set  S).  Express  the  strong  convexity  and  Lipschitz  constants  of  /  in  terms  of  A',  M,  L, 
and  the  maximum  and  minimum  singular  values  of  F. 

10.5  Adding  a  quadratic  term  to  the  objective.  Suppose  Q  y  0.  The  problem 

minimize  f(x)  +  (Ax  —  b)T  Q(Ax  —  b) 
subject  to  Ax  =  b 

is  equivalent  to  the  original  equality  constrained  optimization  problem  (10.1).  Is  the 
Newton  step  for  this  problem  the  same  as  the  Newton  step  for  the  original  problem? 

10.6  The  Newton  decrement.  Show  that  (10.13)  holds,  i.e., 

f(x )  —  inf{/(x  +  v)  |  A(x  +  v)  =  b}  =  \(x)2 / 2. 


Infeasible  start  Newton  method 


10.7  Assumptions  for  infeasible  start  Newton  method.  Consider  the  set  of  assumptions  given 
on  page  536. 

(a)  Suppose  that  the  function  /  is  closed.  Show  that  this  implies  that  the  norm  of  the 
residual,  ||r(x,^)||2,  is  closed. 

(b)  Show  that  Dr  satisfies  a  Lipschitz  condition  if  and  only  if  V2/  does. 

10.8  Infeasible  start  Newton  method  and  initially  satisfied  equality  constraints.  Suppose  we  use 
the  infeasible  start  Newton  method  to  minimize  f(x)  subject  to  af x  =  bi,  i  =  1, . . .  ,p. 

(a)  Suppose  the  initial  point  x ®  satisfies  the  linear  equality  af  x  =  bi.  Show  that  the 
linear  equality  will  remain  satisfied  for  future  iterates,  i.e.,  if  af  =  bi  for  all  k. 

(b)  Suppose  that  one  of  the  equality  constraints  becomes  satisfied  at  iteration  k,  i.e., 
we  have  af  a:*-*1-1)  ^  5^  af  =  bi.  Show  that  at  iteration  k,  all  the  equality 
constraints  are  satisfied. 

10.9  Equality  constrained  entropy  maximization.  Consider  the  equality  constrained  entropy 
maximization  problem 


minimize  f(x)  =  Xi  log  Xi 
subject  to  Ax  =  b, 


(10.42) 


with  dom  /  =  R"  +  and  A  £  Rpx™.  We  assume  the  problem  is  feasible  and  that  rank  A  = 
p  <  n. 

(a)  Show  that  the  problem  has  a  unique  optimal  solution  x* . 

(b)  Find  A,  b,  and  feasible  x^0>  for  which  the  sublevel  set 

{*  £  R++  |  Ax  =  b,  f(x)  <  /(*(0))} 

is  not  closed.  Thus,  the  assumptions  listed  in  §10.2.4,  page  529,  are  not  satisfied  for 
some  feasible  initial  points. 
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(c)  Show  that  the  problem  (10.42)  satisfies  the  assumptions  for  the  infeasible  start 
Newton  method  listed  in  §10.3.3,  page  536,  for  any  feasible  starting  point. 

(d)  Derive  the  Lagrange  dual  of  (10.42),  and  explain  how  to  find  the  optimal  solution 
of  (10.42)  from  the  optimal  solution  of  the  dual  problem.  Show  that  the  dual  problem 
satisfies  the  assumptions  listed  in  §10.2.4,  page  529,  for  any  starting  point. 

The  results  of  part  (b),  (c),  and  (d)  do  not  mean  the  standard  Newton  method  will  fail, 
or  that  the  infeasible  start  Newton  method  or  dual  method  will  work  better  in  practice. 
It  only  means  our  convergence  analysis  for  the  standard  Newton  method  does  not  apply, 
while  our  convergence  analysis  does  apply  to  the  infeasible  start  and  dual  methods.  (See 
exercise  10.15.) 

10.10  Bounded  inverse  derivative  condition  for  strongly  convex- concave  game.  Consider  a  convex- 
concave  game  with  payoff  function  /  (see  page  541).  Suppose  V2u/(it,v)  X  ml  and 
VyVf(u,v)  X  —ml,  for  all  (u,v)  £  dom /.  Show  that 

||jDi-(«,w)-1||2  =  ||V2/(m,  v)-1||2  <  1/m. 

Implementation 

10.11  Consider  the  resource  allocation  problem  described  in  example  10.1.  You  can  assume  the 
fi  are  strongly  convex,  i.e.,  f"(z )  >  m  >  0  for  all  z. 

(a)  Find  the  computational  effort  required  to  compute  a  Newton  step  for  the  reduced 
problem.  Be  sure  to  exploit  the  special  structure  of  the  Newton  equations. 

(b)  Explain  how  to  solve  the  problem  via  the  dual.  You  can  assume  that  the  conjugate 
functions  f  * ,  and  their  derivatives,  are  readily  computable,  and  that  the  equation 
f'i(x)  =  v  is  readily  solved  for  x,  given  v.  What  is  the  computational  complexity  of 
finding  a  Newton  step  for  the  dual  problem? 

(c)  What  is  the  computational  complexity  of  computing  a  Newton  step  for  the  resource 
allocation  problem?  Be  sure  to  exploit  the  special  structure  of  the  KKT  equations. 

10.12  Describe  an  efficient  way  to  compute  the  Newton  step  for  the  problem 

minimize  tr(X_1) 
subject  to  tr(AiX)  =  bi,  i  = 

with  domain  S"  +  ,  assuming  p  and  n  have  the  same  order  of  magnitude.  Also  derive  the 
Lagrange  dual  problem  and  give  the  complexity  of  finding  the  Newton  step  for  the  dual 
problem. 

10.13  Elimination  method  for  computing  Newton  step  for  convex-concave  game.  Consider  a 
convex-concave  game  with  payoff  function  /  :  Rp  x  R9  — ¥  R  (see  page  541).  We  assume 
that  /  is  strongly  convex-concave,  i.e.,  for  all  (u,v)  £  dom /  and  some  m  >  0,  we  have 
VL/(«,»)  h  ml  and  V2vvf(u,v)  X  -ml. 

(a)  Show  how  to  compute  the  Newton  step  using  Cholesky  factorizations  of  Vj„/(«,  v ) 
and  — V2/„„( u,  v).  Compare  the  cost  of  this  method  with  the  cost  of  using  an  LDLT 
factorization  of  V/(u,  v),  assuming  V2/(u,  v)  is  dense. 

(b)  Show  how  you  can  exploit  diagonal  or  block  diagonal  structure  in  V2u/(u,  v)  and/or 
V2„/(w,  v).  How  much  do  you  save,  if  you  assume  V2„/(it,  v)  is  dense? 

Numerical  experiments 

10.14  Log- optimal  investment.  Consider  the  log-optimal  investment  problem  described  in  exer¬ 
cise  4.60,  without  the  constraint  if  0.  Use  Newton’s  method  to  compute  the  solution, 
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with  the  following  problem  data:  there  are  n  =  3  assets,  and  m  =  4  scenarios,  with 
returns 


'  2  ' 

'  2  ' 

'  0.5  ' 

'  0.5  ' 

Pi  = 

1.3 

1 

,  P2  = 

0.5 

1 

,  P3  = 

1.3 

1 

)  P4  = 

0.5 

1 

The  probabilities  of  the  four  scenarios  are  given  by  n  =  (1/3, 1/6, 1/3, 1/6). 

10.15  Equality  constrained  entropy  maximization.  Consider  the  equality  constrained  entropy 
maximization  problem 


minimize  f(x)  =  ^™=1  Xi  l°g  xi 
subject  to  Ax  =  b, 

with  dom /  =  R"  +  and  A  £  Rpxn,  with  p  <  n.  (See  exercise  10.9  for  some  relevant 
analysis.) 

Generate  a  problem  instance  with  n  =  100  and  p  =  30  by  choosing  A  randomly  (checking 
that  it  has  full  rank),  choosing  i  as  a  random  positive  vector  ( e.g .,  with  entries  uniformly 
distributed  on  [0, 1])  and  then  setting  b  =  Ax.  (Thus,  x  is  feasible.) 

Compute  the  solution  of  the  problem  using  the  following  methods. 

(a)  Standard  Newton  method.  You  can  use  initial  point  a/0-1  =  x. 

(b)  Infeasible  start  Newton  method.  You  can  use  initial  point  x ^  =  x  (to  compare  with 
the  standard  Newton  method),  and  also  the  initial  point  a/0-*  =  1. 

(c)  Dual  Newton  method ,  i.e.,  the  standard  Newton  method  applied  to  the  dual  problem. 

Verify  that  the  three  methods  compute  the  same  optimal  point  (and  Lagrange  multiplier). 
Compare  the  computational  effort  per  step  for  the  three  methods,  assuming  relevant 
structure  is  exploited.  (Your  implementation,  however,  does  not  need  to  exploit  structure 
to  compute  the  Newton  step.) 

10.16  Convex-concave  game.  Use  the  infeasible  start  Newton  method  to  solve  convex- concave 
games  of  the  form  (10.32),  with  randomly  generated  data.  Plot  the  norm  of  the  residual 
and  step  length  versus  iteration.  Experiment  with  the  line  search  parameters  and  initial 
point  (which  must  satisfy  || w.|| 2  <  1,  IMI2  <  1,  however). 


Chapter  11 

Interior-point  methods 


11.1  Inequality  constrained  minimization  problems 

In  this  chapter  we  discuss  interior-point  methods  for  solving  convex  optimization 
problems  that  include  inequality  constraints, 

minimize  fo  ( x ) 

subject  to  fi(x)  <0,  i  =  1, ...  ,m  (11.1) 

Ax  =  b , 

where  fo,...,  fm  :  R"  R  are  convex  and  twice  continuously  differentiable,  and 
A  £  Rpx"  with  rank  A  =  p  <  n.  We  assume  that  the  problem  is  solvable,  i.e.,  an 
optimal  x*  exists.  We  denote  the  optimal  value  fo{x *)  as  p* . 

We  also  assume  that  the  problem  is  strictly  feasible,  i.e.,  there  exists  x  £  V  that 
satisfies  Ax  =  b  and  /,;( x)  <  0  for  i  =  1, . . . ,  m.  This  means  that  Slater’s  constraint 
qualification  holds,  so  there  exist  dual  optimal  A*  £  Rm,  v*  £  Rp,  which  together 
with  x*  satisfy  the  KKT  conditions 

Ax*  =  b,  fi(  x*) 

X* 

V/o(®*)  +  EZi  A IVfiix*)  +  ATv* 

KMX*) 

Interior-point  methods  solve  the  problem  (11.1)  (or  the  KKT  conditions  (11.2)) 
by  applying  Newton’s  method  to  a  sequence  of  equality  constrained  problems,  or 
to  a  sequence  of  modified  versions  of  the  KKT  conditions.  We  will  concentrate  on 
a  particular  interior-point  algorithm,  the  barrier  method,  for  which  we  give  a  proof 
of  convergence  and  a  complexity  analysis.  We  also  describe  a  simple  primal-dual 
interior-point  method  (in  §11.7),  but  do  not  give  an  analysis. 

We  can  view  interior-point  methods  as  another  level  in  the  hierarchy  of  convex 
optimization  algorithms.  Linear  equality  constrained  quadratic  problems  are  the 
simplest.  For  these  problems  the  KKT  conditions  are  a  set  of  linear  equations, 
which  can  be  solved  analytically.  Newton’s  method  is  the  next  level  in  the  hierarchy. 
We  can  think  of  Newton’s  method  as  a  technique  for  solving  a  linear  equality 


<  0,  i  =  1, . . . ,  m 
h  0 
=  0 

=  0,  i  =  1, . . . ,  m. 


(11.2) 
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constrained  optimization  problem,  with  twice  differentiable  objective,  by  reducing 
it  to  a  sequence  of  linear  equality  constrained  quadratic  problems.  Interior-point 
methods  form  the  next  level  in  the  hierarchy:  They  solve  an  optimization  problem 
with  linear  equality  and  inequality  constraints  by  reducing  it  to  a  sequence  of  linear 
equality  constrained  problems. 

Examples 

Many  problems  are  already  in  the  form  (11.1),  and  satisfy  the  assumption  that  the 
objective  and  constraint  functions  are  twice  differentiable.  Obvious  examples  are 
LPs,  QPs,  QCQPs,  and  GPs  in  convex  form;  another  example  is  linear  inequality 
constrained  entropy  maximization, 

minimize  X^=i  xi  1°S  x% 
subject  to  Fx  <  g 
Ax  =  6, 


with  domain  T>  =  R"  +  . 

Many  other  problems  do  not  have  the  required  form  (11.1),  with  twice  differen¬ 
tiable  objective  and  constraint  functions,  but  can  be  reformulated  in  the  required 
form.  We  have  already  seen  many  examples  of  this,  such  as  the  transformation  of 
an  unconstrained  convex  piecewise-linear  minimization  problem 

minimize  max,--[  !m(a^  x  +  bi) 

(with  nondifferentiable  objective),  to  the  LP 
minimize  t 

subject  to  aj x  +  bi  <  t,  i  =  1, . . . ,  m 

(which  has  twice  differentiable  objective  and  constraint  functions). 

Other  convex  optimization  problems,  such  as  SOCPs  and  SDPs,  are  not  readily 
recast  in  the  required  form,  but  can  be  handled  by  extensions  of  interior-point 
methods  to  problems  with  generalized  inequalities,  which  we  describe  in  §11.6. 


11.2  Logarithmic  barrier  function  and  central  path 


Our  goal  is  to  approximately  formulate  the  inequality  constrained  problem  (11.1) 
as  an  equality  constrained  problem  to  which  Newton’s  method  can  be  applied. 
Our  first  step  is  to  rewrite  the  problem  (11.1),  making  the  inequality  constraints 
implicit  in  the  objective: 


minimize  /0( x)  +  YJiLx  J-(fi(x)) 

subject  to  Ax  =  6, 


(11.3) 


where  /_  :  R  — >  R  is  the  indicator  function  for  the  nonpositive  reals, 


M«) 


0  u  <  0 
oo  u  >  0. 
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Figure  11.1  The  dashed  lines  show  the  function  /-(«),  and  the  solid  curves 
show  I-(u)  =  —  {l/t)  log(— u),  for  t  =  0.5,  1,  2.  The  curve  for  t  =  2  gives 
the  best  approximation. 


The  problem  (11.3)  has  no  inequality  constraints,  but  its  objective  function  is  not 
(in  general)  differentiable,  so  Newton’s  method  cannot  be  applied. 


11.2.1  Logarithmic  barrier 

The  basic  idea  of  the  barrier  method  is  to  approximate  the  indicator  function 
by  the  function 

I-(u )  =  —  (l/t)  log(— u),  dom/_  =  —  R++, 

where  t  >  0  is  a  parameter  that  sets  the  accuracy  of  the  approximation.  Like 
the  function  is  convex  and  nondecreasing,  and  (by  our  convention)  takes 
on  the  value  oo  for  u  >  0.  Unlike  however,  /_  is  differentiable  and  closed: 
it  increases  to  oo  as  u  increases  to  0.  Figure  11.1  shows  the  function  / _,  and 
the  approximation  for  several  values  of  t.  As  t  increases,  the  approximation 
becomes  more  accurate. 

Substituting  /_  for  /_  in  (11.3)  gives  the  approximation 

minimize  f0(x)  +  -(l/t)  log{- fi{x))  4, 

subject  to  Ax  =  b.  V  •  1 

The  objective  here  is  convex,  since  —  (l/t)  log(— u)  is  convex  and  increasing  in  u, 
and  differentiable.  Assuming  an  appropriate  closedness  condition  holds,  Newton’s 
method  can  be  used  to  solve  it. 

The  function 

m 

<t>(x )  =  -X]log  (-/*(*))> 

i= 1 


(11.5) 
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with  dom  (j>  =  {x  €  R"  I  ft  (x)  <0,  i  =  1, . . . ,  m},  is  called  the  logarithmic  barrier 
or  log  barrier  for  the  problem  (11.1).  Its  domain  is  the  set  of  points  that  satisfy 
the  inequality  constraints  of  (11.1)  strictly.  No  matter  what  value  the  positive 
parameter  t  has,  the  logarithmic  barrier  grows  without  bound  if  fi(x)  — >  0,  for 
any  i. 

Of  course,  the  problem  (11.4)  is  only  an  approximation  of  the  original  prob¬ 
lem  (11.3),  so  one  question  that  arises  immediately  is  how  well  a  solution  of  (11.4) 
approximates  a  solution  of  the  original  problem  (11.3).  Intuition  suggests,  and  we 
will  soon  confirm,  that  the  quality  of  the  approximation  improves  as  the  parameter 
t  grows. 

On  the  other  hand,  when  the  parameter  t  is  large,  the  function  f0  +  (1  /t)(j>  is 
difficult  to  minimize  by  Newton’s  method,  since  its  Hessian  varies  rapidly  near  the 
boundary  of  the  feasible  set.  We  will  see  that  this  problem  can  be  circumvented 
by  solving  a  sequence  of  problems  of  the  form  (11.4),  increasing  the  parameter  t 
(and  therefore  the  accuracy  of  the  approximation)  at  each  step,  and  starting  each 
Newton  minimization  at  the  solution  of  the  problem  for  the  previous  value  of  t. 

For  future  reference,  we  note  that  the  gradient  and  Hessian  of  the  logarithmic 
barrier  function  (f>  are  given  by 

771  1 

i= 1 

771  1  771  1 

v2^)  =  E^^^w'+E^v2/,^) 

(see  §A.4.2  and  §A.4.4). 

11.2.2  Central  path 

We  now  consider  in  more  detail  the  minimization  problem  (11.4).  It  will  simplify 
notation  later  on  if  we  multiply  the  objective  by  t,  and  consider  the  equivalent 
problem 

minimize  tf0(x)  +  (j>(x)  H 1  fil 

subject  to  Ax  =  b,  ' 

which  has  the  same  minimizers.  We  assume  for  now  that  the  problem  (11.6)  can 
be  solved  via  Newton’s  method,  and,  in  particular,  that  it  has  a  unique  solution 
for  each  t  >  0.  (We  will  discuss  this  assumption  in  more  detail  in  §11.3.3.) 

For  t  >  0  we  define  x*(t)  as  the  solution  of  (11.6).  The  central  path  associated 
with  problem  (11.1)  is  defined  as  the  set  of  points  x*(t),  t  >  0,  which  we  call 
the  central  points.  Points  on  the  central  path  are  characterized  by  the  following 
necessary  and  sufficient  conditions:  x*(t)  is  strictly  feasible,  i.e.,  satisfies 

Ax*(t)=b,  fi(x*(t))<  0,  i  =  l,...,m, 

and  there  exists  a  t>  £  Rp  such  that 

0  =  tV  fo(x*  (t))  +  V(j)(x*(t))  +  ATt> 
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holds. 


tVf0(x*(t)) 


V 


(11.7) 


Example  11.1  Inequality  form  linear  programming.  The  logarithmic  barrier  function 
for  an  LP  in  inequality  form, 


is  given  by 


minimize  cTx 
subject  to  Ax  <  b, 


m 

(j>{x)  =  —  log(6i  —  ajx),  dome/)  =  {x  \  Ax  -<  &}, 

i= 1 


(11.8) 


where  a{  ,  . . . ,  ajn  are  the  rows  of  A.  The  gradient  and  Hessian  of  the  barrier  function 
are 


=  E  v2<^)  =  E 


i 


bi  —  aj  x 


( bi-ajx )2 


T 

ami  , 


or,  more  compactly, 


i=l 

\?4>{x)  =  AT d,  V2<p(x)  =  At  diag(d)2A, 


where  the  elements  of  d  £  Rm  are  given  by  di  =  l /{bi  —  ajx).  Since  x  is  strictly 
feasible,  we  have  d>~  0,  so  the  Hessian  of  (f)  is  nonsingular  if  and  only  if  A  has  rank  n. 

The  centrality  condition  (11.7)  is 


tc  +  V- — t_ai  =  tc  +  AT  d  =  0.  (11-9) 

Oi  CL-  X 

i=  1 

We  can  give  a  simple  geometric  interpretation  of  the  centrality  condition.  At  a  point 
x*(t)  on  the  central  path  the  gradient  Vcj>(x*(t)),  which  is  normal  to  the  level  set  of  (j> 
through  x*(t),  must  be  parallel  to  — c.  In  other  words,  the  hyperplane  cTx  =  cTx*(t) 
is  tangent  to  the  level  set  of  (j>  through  x*(t).  Figure  11.2  shows  an  example  with 
m  =  6  and  n  =  2. 


Dual  points  from  central  path 

From  (11.7)  we  can  derive  an  important  property  of  the  central  path:  Every  central 
point  yields  a  dual  feasible  point,  and  hence  a  lower  bound  on  the  optimal  value 
p *.  More  specifically,  define 

m  =  ~if^wr  i  =  1, =  (1L10) 

We  claim  that  the  pair  A *(t),  v*(t)  is  dual  feasible. 

First,  it  is  clear  that  A *{t)  >~  0  because  fi(x*(t))  <  0,  i  =  1, . . . ,  m.  By 
expressing  the  optimality  conditions  (11.7)  as 

m 

V/„(z*(i))  +  Y,  A*WV/i(x*(t))  +  ATv*(t)  =  0, 

i= 1 
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Figure  11.2  Central  path  for  an  LP  with  n  =  2  and  m  =  6.  The  dashed 
curves  show  three  contour  lines  of  the  logarithmic  barrier  function  <j> .  The 
central  path  converges  to  the  optimal  point  x*  as  t  — ¥  oo.  Also  shown  is  the 
point  on  the  central  path  with  t  =  10.  The  optimality  condition  (11.9)  at 
this  point  can  be  verified  geometrically:  The  line  cTx  =  cTx*(  10)  is  tangent 
to  the  contour  line  of  <j>  through  x*(10). 


we  see  that  x*(t)  minimizes  the  Lagrangian 

m 

L{x,  A,  v)  =  fo(x)  +  ^2  ^ ifi(x )  +  vT(Ax  ~  b '), 

for  A  =  A*(f)  and  v  =  which  means  that  A*(£),  is  a  dual  feasible  pair. 

Therefore  the  dual  function  g(X*(t),i/*(t))  is  finite,  and 

m 

=  /0(a;*(£))+^A*(£)/i(a:*(£))  +  ^(£)T(Ax*(£)-6) 

i—1 

=  -m/t. 

In  particular,  the  duality  gap  associated  with  x*(t)  and  the  dual  feasible  pair  A *(£), 
v*(t)  is  simply  m/t.  As  an  important  consequence,  we  have 

fo(x*(t))  -p*  <  m/t , 

i.e.,  x*(t)  is  no  more  than  m/£-suboptimal.  This  confirms  the  intuitive  idea  that 
x*(t)  converges  to  an  optimal  point  as  t  — >  oo. 


Example  11.2  Inequality  form  linear  programming.  The  dual  of  the  inequality  form 
LP  (11.8)  is 

maximize  —bT  A 
subject  to  At  A  +  c  =  0 
A  y  0. 

From  the  optimality  conditions  (11.9),  it  is  clear  that 

A  *(£)  = 


t{bi-ajx*{t))' 


i  =  1,  , . . ,  to, 
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is  dual  feasible,  with  dual  objective  value 

—bT  A*  (t)  =  cTx*(t)  +  {Ax*  {t)  —  b)T  A*  (t)  =  cTx*{t)  —  m/t. 


Interpretation  via  KKT  conditions 

We  can  also  interpret  the  central  path  conditions  (11.7)  as  a  continuous  deformation 
of  the  KKT  optimality  conditions  (11.2).  A  point  x  is  equal  to  x*(t)  if  and  only  if 
there  exists  A,  v  such  that 

Ax  =  b ,  fi(x)  <  0,  i  =  1, . . . ,  to 

v/0(*)  +  ES=iAiV/i(*)  +  ^  =  o  (1L11) 

-A ifi(x)  =  1/t,  i  =  1, . . .  ,m. 

The  only  difference  between  the  KKT  conditions  (11.2)  and  the  centrality  condi¬ 
tions  (11.11)  is  that  the  complementarity  condition  —Xifi{x)  =  0  is  replaced  by 
the  condition  — \fi{x )  =  1/t.  In  particular,  for  large  t,  x*{t)  and  the  associated 
dual  point  A *(t),  z/*(t)  ‘almost’  satisfy  the  KKT  optimality  conditions  for  (11.1). 

Force  field  interpretation 

We  can  give  a  simple  mechanics  interpretation  of  the  central  path  in  terms  of 
potential  forces  acting  on  a  particle  in  the  strictly  feasible  set  C.  For  simplicity  we 
assume  that  there  are  no  equality  constraints. 

We  associate  with  each  constraint  the  force 

Fi(x)  =  -V  (—  k>g(— /i(a:)))  =  — ^V/^a;) 

/iW 

acting  on  the  particle  when  it  is  at  position  x.  The  potential  associated  with  the 
total  force  field  generated  by  the  constraints  is  the  logarithmic  barrier  </>.  As  the 
particle  moves  toward  the  boundary  of  the  feasible  set,  it  is  strongly  repelled  by 
the  forces  generated  by  the  constraints. 

Now  we  imagine  another  force  acting  on  the  particle,  given  by 

F0{x)  =  -tA7f0(x), 

when  the  particle  is  at  position  x.  This  objective  force  field  acts  to  pull  the  particle 
in  the  negative  gradient  direction,  be.,  toward  smaller  /q.  The  parameter  t  scales 
the  objective  force,  relative  to  the  constraint  forces. 

The  central  point  x*(t)  is  the  point  where  the  constraint  forces  exactly  balance 
the  objective  force  felt  by  the  particle.  As  the  parameter  t,  increases,  the  particle  is 
more  strongly  pulled  toward  the  optimal  point,  but  it  is  always  trapped  in  C  by  the 
barrier  potential,  which  becomes  infinite  as  the  particle  approaches  the  boundary. 


Example  11.3  Force  field  interpretation  for  inequality  form  LP.  The  force  field  asso¬ 
ciated  with  the  ith  constraint  of  the  LP  (11.8)  is 


—di 

bi  —  ajx 
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Figure  11.3  Force  field  interpretation  of  central  path.  The  central  path  is 
shown  as  the  dashed  curve.  The  two  points  ®*(1)  and  £*(3)  are  shown  as 
dots  in  the  left  and  right  plots,  respectively.  The  objective  force,  which  is 
equal  to  —  c  and  —3c,  respectively,  is  shown  as  a  heavy  arrow.  The  other 
arrows  represent  the  constraint  forces,  which  are  given  by  an  inverse-distance 
law.  As  the  strength  of  the  objective  force  varies,  the  equilibrium  position 
of  the  particle  traces  out  the  central  path. 


This  force  is  in  the  direction  of  the  inward  pointing  normal  to  the  constraint  plane 
Hi  =  {x  |  aj x  =  h},  and  has  magnitude  inversely  proportional  to  the  distance  to 
Hi,  i.e., 

\\Fi(x)h  =  ,hihT  =  *  ■ 

bi  —  af  x  aist(x,  Hi) 

In  other  words,  each  constraint  hyperplane  has  an  associated  repulsive  force,  given 
by  the  inverse  distance  to  the  hyperplane. 

The  term  tcTx  is  the  potential  associated  with  a  constant  force  —  tc  on  the  particle. 
This  ‘objective  force’  pushes  the  particle  in  the  direction  of  low  cost.  Thus,  x*(t) 
is  the  equilibrium  position  of  the  particle  when  it  is  subject  to  the  inverse-distance 
constraint  forces,  and  the  objective  force  —  tc.  When  t  is  very  large,  the  particle  is 
pushed  almost  to  the  optimal  point.  The  strong  objective  force  is  balanced  by  the 
opposing  constraint  forces,  which  are  large  because  we  are  near  the  feasible  boundary. 

Figure  11.3  illustrates  this  interpretation  for  a  small  LP  with  n  =  2  and  m  =  5.  The 
lefthand  plot  shows  x*(t)  for  t  —  1,  as  well  as  the  constraint  forces  acting  on  it,  which 
balance  the  objective  force.  The  righthand  plot  shows  x*(t)  and  the  associated  forces 
for  t  =  3.  The  larger  value  of  objective  force  moves  the  particle  closer  to  the  optimal 
point. 


11.3  The  barrier  method 

We  have  seen  that  the  point  x*(t)  is  m/f-suboptimal,  and  that  a  certificate  of  this 
accuracy  is  provided  by  the  dual  feasible  pair  A *(t),  v*{t).  This  suggests  a  very 
straightforward  method  for  solving  the  original  problem  (11.1)  with  a  guaranteed 
specified  accuracy  e:  We  simply  take  t  =  m/e  and  solve  the  equality  constrained 


11.3  The  barrier  method 


569 


problem 

minimize  (m/e)fo(x)  +  <j>(x) 
subject  to  Ax  =  b 

using  Newton’s  method.  This  method  could  be  called  the  unconstrained  minimiza¬ 
tion  method ,  since  it  allows  us  to  solve  the  inequality  constrained  problem  (11.1)  to 
a  guaranteed  accuracy  by  solving  an  unconstrained,  or  linearly  constrained,  prob¬ 
lem.  Although  this  method  can  work  well  for  small  problems,  good  starting  points, 
and  moderate  accuracy  ( i.e e  not  too  small),  it  does  not  work  well  in  other  cases. 
As  a  result  it  is  rarely,  if  ever,  used. 


11.3.1  The  barrier  method 

A  simple  extension  of  the  unconstrained  minimization  method  does  work  well.  It 
is  based  on  solving  a  sequence  of  unconstrained  (or  linearly  constrained)  mini¬ 
mization  problems,  using  the  last  point  found  as  the  starting  point  for  the  next 
unconstrained  minimization  problem.  In  other  words,  we  compute  x*(t)  for  a  se¬ 
quence  of  increasing  values  of  t,  until  t  >  m/e,  which  guarantees  that  we  have  an 
e-suboptimal  solution  of  the  original  problem.  When  the  method  was  first  proposed 
by  Fiacco  and  McCormick  in  the  1960s,  it  was  called  the  sequential  unconstrained 
minimization  technique  (SUMT).  Today  the  method  is  usually  called  the  barrier 
method  or  path-following  method.  A  simple  version  of  the  method  is  as  follows. 


Algorithm  11.1  Barrier  method. 

given  strictly  feasible  x,  t  :=  >  0,  p  >  1,  tolerance  e  >  0. 

repeat 

1.  Centering  step. 

Compute  x*(t)  by  minimizing  tfo  +  <t>,  subject  to  Ax  =  b,  starting  at  x. 

2.  Update,  x  :=  a ;*(t). 

3.  Stopping  criterion,  quit  if  m/t  <  e. 

4.  Increase  t.  t  :=  pt. 


At  each  iteration  (except  the  first  one)  we  compute  the  central  point  x*(t)  starting 
from  the  previously  computed  central  point,  and  then  increase  t  by  a  factor  p  >  1. 
The  algorithm  can  also  return  A  =  A *(f),  and  v  =  z/*(i),  a  dual  e-suboptimal  point, 
or  certificate  for  x. 

We  refer  to  each  execution  of  step  1  as  a  centering  step  (since  a  central  point 
is  being  computed)  or  an  outer  iteration ,  and  to  the  first  centering  step  (the  com¬ 
putation  of  x'ft^))  as  the  initial  centering  step.  (Thus  the  simple  algorithm  with 
f(°)  =  m/e  consists  of  only  the  initial  centering  step.)  Although  any  method  for 
linearly  constrained  minimization  can  be  used  in  step  1,  we  will  assume  that  New¬ 
ton’s  method  is  used.  We  refer  to  the  Newton  iterations  or  steps  executed  during 
the  centering  step  as  inner  iterations.  At  each  inner  step,  we  have  a  primal  fea¬ 
sible  point;  we  have  a  dual  feasible  point,  however,  only  at  the  end  of  each  outer 
(centering)  step. 
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Accuracy  of  centering 

We  should  make  some  comments  on  the  accuracy  to  which  we  solve  the  centering 
problems.  Computing  x*(t)  exactly  is  not  necessary  since  the  central  path  has  no 
significance  beyond  the  fact  that  it  leads  to  a  solution  of  the  original  problem  as 
t  —>  oo ;  inexact  centering  will  still  yield  a  sequence  of  points  that  converges  to 
an  optimal  point.  Inexact  centering,  however,  means  that  the  points  A *{t),  v*{t), 
computed  from  (11.10),  are  not  exactly  dual  feasible.  This  can  be  corrected  by 
adding  a  correction  term  to  the  formula  (11.10),  which  yields  a  dual  feasible  point 
provided  the  computed  x  is  near  the  central  path,  i.e.,  x*(t)  (see  exercise  11.9). 

On  the  other  hand,  the  cost  of  computing  an  extremely  accurate  minimizer  of 
tfo  +  <t>>  as  compared  to  the  cost  of  computing  a  good  minimizer  of  tfo  +  <f>,  is 
only  marginally  more,  i.e.,  a  few  Newton  steps  at  most.  For  this  reason  it  is  not 
unreasonable  to  assume  exact  centering. 

Choice  of  p 

The  choice  of  the  parameter  p  involves  a  trade-off  in  the  number  of  inner  and  outer 
iterations  required.  If  p  is  small  {i.e.,  near  1)  then  at  each  outer  iteration  t  increases 
by  a  small  factor.  As  a  result  the  initial  point  for  the  Newton  process,  i.e.,  the 
previous  iterate  x,  is  a  very  good  starting  point,  and  the  number  of  Newton  steps 
needed  to  compute  the  next  iterate  is  small.  Thus  for  small  p  we  expect  a  small 
number  of  Newton  steps  per  outer  iteration,  but  of  course  a  large  number  of  outer 
iterations  since  each  outer  iteration  reduces  the  gap  by  only  a  small  amount.  In 
this  case  the  iterates  (and  indeed,  the  iterates  of  the  inner  iterations  as  well)  closely 
follow  the  central  path.  This  explains  the  alternate  name  path-following  method. 

On  the  other  hand  if  p  is  large  we  have  the  opposite  situation.  After  each 
outer  iteration  t.  increases  a  large  amount,  so  the  current  iterate  is  probably  not 
a  very  good  approximation  of  the  next  iterate.  Thus  we  expect  many  more  inner 
iterations.  This  ‘aggressive’  updating  of  t  results  in  fewer  outer  iterations,  since  the 
duality  gap  is  reduced  by  the  large  factor  p  at  each  outer  iteration,  but  more  inner 
iterations.  With  p  large,  the  iterates  are  widely  separated  on  the  central  path;  the 
inner  iterates  veer  way  off  the  central  path. 

This  trade-off  in  the  choice  of  /i  is  confirmed  both  in  practice  and,  as  we  will 
see,  in  theory.  In  practice,  small  values  of  p  {i.e.,  near  one)  result  in  many  outer 
iterations,  with  just  a  few  Newton  steps  for  each  outer  iteration.  For  p  in  a  fairly 
large  range,  from  around  3  to  100  or  so,  the  two  effects  nearly  cancel,  so  the  total 
number  of  Newton  steps  remains  approximately  constant.  This  means  that  the 
choice  of  p  is  not  particularly  critical;  values  from  around  10  to  20  or  so  seem  to 
work  well.  When  the  parameter  p  is  chosen  to  give  the  best  worst-case  bound  on 
the  total  number  of  Newton  steps  required,  values  of  p  near  one  are  used. 

Choice  of 

Another  important  issue  is  the  choice  of  initial  value  of  t.  Here  the  trade-off  is 
simple:  If  t ^  is  chosen  too  large,  the  first  outer  iteration  will  require  too  many  it¬ 
erations.  If  t is  chosen  too  small,  the  algorithm  will  require  extra  outer  iterations, 
and  possibly  too  many  inner  iterations  in  the  first  centering  step. 

Since  m/t is  the  duality  gap  that  will  result  from  the  first  centering  step,  one 


11.3  The  barrier  method 


571 


reasonable  choice  is  to  choose  so  that  m/t ^  is  approximately  of  the  same  order 
as  /0(x(0))  —  p* ,  or  p  times  this  amount.  For  example,  if  a  dual  feasible  point  A, 
v  is  known,  with  duality  gap  ??  =  fo(x^)  —  g( A,  v),  then  we  can  take  t ^  =  m/p. 
Thus,  in  the  Hrst  outer  iteration  we  simply  compute  a  pair  with  the  same  duality 
gap  as  the  initial  primal  and  dual  feasible  points. 

Another  possibility  is  suggested  by  the  central  path  condition  (11.7).  We  can 
interpret 


inf 


tV,/o(x(0))  +  V<)»(x(0))  +  ATv 


(11.12) 


as  a  measure  for  the  deviation  of  x^  from  the  point  x*(t),  and  choose  for  the 
value  that  minimizes  (11.12).  (This  value  of  t  and  v  can  be  found  by  solving  a 
least-squares  problem.) 

A  variation  on  this  approach  uses  an  affine-invariant  measure  of  deviation  be¬ 
tween  x  and  x*(t)  in  place  of  the  Euclidean  norm.  We  choose  t  and  v  that  minimize 


a(t,  v)  =  (t\7 fo(x^)  +  V(^(x® )  +  Atis^J  H0  1  (tV  fo(x^)  +  V^(x^)  +  Ari^j  , 


where 

Ho  =  fV2/0(x(0))  +  V2<^(x^). 

(It  can  be  shown  that  inf„  a(t,  v)  is  the  square  of  the  Newton  decrement  of  tfo  +  (j> 
at  x*-0-*.)  Since  a  is  a  quadratic-over-linear  function  of  v  and  t,  it  is  convex. 


Infeasible  start  Newton  method 

In  one  variation  on  the  barrier  method,  an  infeasible  start  Newton  method  (de¬ 
scribed  in  §10.3)  is  used  for  the  centering  steps.  Thus,  the  barrier  method  is  ini¬ 
tialized  with  a  point  x^  that  satisfies  x^  £  dom  f0  and  fi(x^)  <  0,  i  =  1, . . . ,  to, 
but  not  necessarily  Ax^  =  b.  Assuming  the  problem  is  strictly  feasible,  a  full  New¬ 
ton  step  is  taken  at  some  point  during  the  first  centering  step,  and  thereafter,  the 
iterates  are  all  primal  feasible,  and  the  algorithm  coincides  with  the  (standard) 
barrier  method. 


11.3.2  Examples 

Linear  programming  in  inequality  form 

Our  first  example  is  a  small  LP  in  inequality  form, 

minimize  cTx 
subject  to  Ax  A  b 

with  A  £  R100x50.  The  data  were  generated  randomly,  in  such  a  way  that  the 
problem  is  strictly  primal  and  dual  feasible,  with  optimal  value  p*  =  1. 

The  initial  point  x ^  is  on  the  central  path,  with  a  duality  gap  of  100.  The 
barrier  method  is  used  to  solve  the  problem,  and  terminated  when  the  duality  gap 
is  less  than  10~6.  The  centering  problems  are  solved  by  Newton’s  method  with 
backtracking,  using  parameters  a  =  0.01,  /3  =  0.5.  The  stopping  criterion  for 
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Figure  11.4  Progress  of  barrier  method  for  a  small  LP,  showing  duality 
gap  versus  cumulative  number  of  Newton  steps.  Three  plots  are  shown, 
corresponding  to  three  values  of  the  parameter  p:  2,  50,  and  150.  In  each 
case,  we  have  approximately  linear  convergence  of  duality  gap. 


Newton’s  method  is  \{x)2 /2  <  10— 5 ,  where  \{x)  is  the  Newton  decrement  of  the 
function  tcT x  +  4>{x). 

The  progress  of  the  barrier  method,  for  three  values  of  the  parameter  fj, ,  is 
shown  in  figure  11.4.  The  vertical  axis  shows  the  duality  gap  on  a  log  scale.  The 
horizontal  axis  shows  the  cumulative  total  number  of  inner  iterations,  i.e.,  Newton 
steps,  which  is  the  natural  measure  of  computational  effort.  Each  of  the  plots  has 
a  staircase  shape,  with  each  stair  associated  with  one  outer  iteration.  The  width  of 
each  stair  tread  ( i.e .,  horizontal  portion)  is  the  number  of  Newton  steps  required 
for  that  outer  iteration.  The  height  of  each  stair  riser  (ie.,  the  vertical  portion)  is 
exactly  equal  to  (a  factor  of)  /z,  since  the  duality  gap  is  reduced  by  the  factor  /i  at 
the  end  of  each  outer  iteration. 

The  plots  illustrate  several  typical  features  of  the  barrier  method.  First  of  all, 
the  method  works  very  well,  with  approximately  linear  convergence  of  the  duality 
gap.  This  is  a  consequence  of  the  approximately  constant  number  of  Newton  steps 
required  to  re-center,  for  each  value  of  [i.  For  /z  =  50  and  /i  =  150,  the  barrier 
method  solves  the  problem  with  a  total  number  of  Newton  steps  between  35  and  40. 

The  plots  in  figure  11.4  clearly  show  the  trade-off  in  the  choice  of  fi.  For  /z  =  2, 
the  treads  are  short;  the  number  of  Newton  steps  required  to  re-center  is  around  2 
or  3.  But  the  risers  are  also  short,  since  the  duality  gap  reduction  per  outer  iteration 
is  only  a  factor  of  2.  At  the  other  extreme,  when  /z  =  150,  the  treads  are  longer, 
typically  around  7  Newton  steps,  but  the  risers  are  also  much  larger,  since  the 
duality  gap  is  reduced  by  the  factor  150  in  each  outer  iteration. 

The  trade-off  in  choice  of  /i  is  further  examined  in  figure  11.5.  We  use  the 
barrier  method  to  solve  the  LP,  terminating  when  the  duality  gap  is  smaller  than 
1 0 — 3 ,  for  25  values  of  /z  between  1.2  and  200.  The  plot  shows  the  total  number 
of  Newton  steps  required  to  solve  the  problem,  as  a  function  of  the  parameter  /z. 


11.3  The  barrier  method 


573 


Figure  11.5  Trade-off  in  the  choice  of  the  parameter  fi,  for  a  small  LP.  The 
vertical  axis  shows  the  total  number  of  Newton  steps  required  to  reduce  the 
duality  gap  from  100  to  10  3 ,  and  the  horizontal  axis  shows  fj,.  The  plot 
shows  the  barrier  method  works  well  for  values  of  /r  larger  than  around  3, 
but  is  otherwise  not  sensitive  to  the  value  of  /r. 


This  plot  shows  that  the  barrier  method  performs  very  well  for  a  wide  range  of 
values  of  //,  from  around  3  to  200.  As  our  intuition  suggests,  the  total  number  of 
Newton  steps  rises  when  /r  is  too  small,  due  to  the  larger  number  of  outer  iterations 
required.  One  interesting  observation  is  that  the  total  number  of  Newton  steps  does 
not  vary  much  for  values  of  /i  larger  than  around  3.  Thus,  as  /.t  increases  over  this 
range,  the  decrease  in  the  number  of  outer  iterations  is  offset  by  an  increase  in 
the  number  of  Newton  steps  per  outer  iteration.  For  even  larger  values  of  fi,  the 
performance  of  the  barrier  method  becomes  less  predictable  (i.e.,  more  dependent 
on  the  particular  problem  instance).  Since  the  performance  does  not  improve  with 
larger  values  of  /i,  a  good  choice  is  in  the  range  10  -  100. 

Geometric  programming 

We  consider  a  geometric  program  in  convex  form, 

minimize  log  exp(aj kx  +  60fc)) 

subject  to  log  (j2k=-i.  exp (afkx  +  bik)j  <  0, 

with  variable  x  £  Rn,  and  associated  logarithmic  barrier 

m  /  Ki 

</>( x )  =  -  log  ( _  lo§  exp  (aikX + 

i—l  V  k= 1 


i  =  1, . . . ,  m, 


The  problem  instance  we  consider  has  n  =  50  variables  and  m  =  100  inequalities 
(like  the  small  LP  considered  above).  The  objective  and  constraint  functions  all 
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Newton  iterations 


Figure  11.6  Progress  of  barrier  method  for  a  small  GP,  showing  duality  gap 
versus  cumulative  number  of  Newton  steps.  Again  we  have  approximately 
linear  convergence  of  duality  gap. 


have  Kj  =  5  terms.  The  problem  instance  was  generated  randomly,  in  such  a  way 
that  it  is  strictly  primal  and  dual  feasible,  with  optimal  value  one. 

We  start  with  a  point  on  the  central  path,  with  a  duality  gap  of  100.  The 
barrier  method  is  used  to  solve  the  problem,  with  parameters  fi  =  2,  fi  =  50,  and 
f 1  =  150,  and  terminated  when  the  duality  gap  is  less  than  10-6.  The  centering 
problems  are  solved  using  Newton’s  method,  with  the  same  parameter  values  as  in 
the  LP  example,  i.e.,  a  =  0.01,  0  =  0.5,  and  stopping  criterion  \{x)2 /2  <  10-5. 

Figure  11.6  shows  the  duality  gap  versus  cumulative  number  of  Newton  steps. 
This  plot  is  very  similar  to  the  plot  for  LP,  shown  in  figure  11.4.  In  particular, 
we  see  an  approximately  constant  number  of  Newton  steps  required  per  centering 
step,  and  therefore  approximately  linear  convergence  of  the  duality  gap. 

The  variation  of  the  total  number  of  Newton  steps  required  to  solve  the  problem, 
versus  the  parameter  fi,  is  very  similar  to  that  in  the  LP  example.  For  this  GP, 
the  total  number  of  Newton  steps  required  to  reduce  the  duality  gap  below  10-3 
is  around  30  (ranging  from  around  20  to  40  or  so)  for  values  of  fi  between  10  and 
200.  So  here,  too,  a  good  choice  of  fi  is  in  the  range  10  -  100. 

A  family  of  standard  form  LPs 

In  the  examples  above  we  examined  the  progress  of  the  barrier  method,  in  terms  of 
duality  gap  versus  cumulative  number  of  Newton  steps,  for  a  randomly  generated 
instance  of  an  LP  and  a  GP,  with  similar  dimensions.  The  results  for  the  two 
examples  are  remarkably  similar;  each  shows  approximately  linear  convergence  of 
duality  gap  with  the  number  of  Newton  steps.  We  also  examined  the  variation  in 
performance  with  the  parameter  // ,  and  found  essentially  the  same  results  in  the 
two  cases.  For  fi  above  around  10,  the  barrier  method  performs  very  well,  requiring 
around  30  Newton  steps  to  bring  the  duality  gap  down  from  102  to  10-6.  In  both 
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cases,  the  choice  of  /i  hardly  affects  the  total  number  of  Newton  steps  required 
(provided  [i  is  larger  than  10  or  so). 

In  this  section  we  examine  the  performance  of  the  barrier  method  as  a  function 
of  the  problem  dimensions.  We  consider  LPs  in  standard  form, 

minimize  cTx 

subject  to  Ax  =  b,  x  y  0 

with  A  £  Rmxn,  and  explore  the  total  number  of  Newton  steps  required  as  a 
function  of  the  number  of  variables  n  and  number  of  equality  constraints  m,  for  a 
family  of  randomly  generated  problem  instances.  We  take  n  =  2m,  i.e.,  twice  as 
many  variables  as  constraints. 

The  problems  were  generated  as  follows.  The  elements  of  A  are  independent  and 
identically  distributed,  with  zero  mean,  unit  variance  normal  distribution  M( 0, 1). 
We  take  b  =  Ax^  where  the  elements  of  a;^  are  independent,  and  uniformly 
distributed  in  [0, 1].  This  ensures  that  the  problem  is  strictly  primal  feasible,  since 
>~  0  is  feasible.  To  construct  the  cost  vector  c,  we  first  compute  a  vector 
z  £  Rm  with  elements  distributed  according  to  W(0, 1)  and  a  vector  s  £  R"  with 
elements  from  a  uniform  distribution  on  [0, 1].  We  then  take  c  =  ATz  ±  s.  This 
guarantees  that  the  problem  is  strictly  dual  feasible,  since  ATz  -<  c. 

The  algorithm  parameters  we  use  are  /a  =  100,  and  the  same  parameters  for  the 
centering  steps  in  the  examples  above:  backtracking  parameters  a  =  0.01,  /?  =  0.5, 
and  stopping  criterion  \(x)2 /2  <  10-5.  The  initial  point  is  on  the  central  path 
with  =  1  (i.e.,  gap  n).  The  algorithm  is  terminated  when  the  initial  duality 
gap  is  reduced  by  a  factor  10  ,  i.e.,  after  completing  two  outer  iterations. 

Figure  11.7  shows  the  duality  gap  versus  iteration  number  for  three  problem 
instances,  with  dimensions  m  =  50,  to  =  500,  and  m  =  1000.  The  plots  look  very 
much  like  the  others,  with  approximately  linear  convergence  of  the  duality  gap. 
The  plots  show  a  small  increase  in  the  number  of  Newton  steps  required  as  the 
problem  size  grows  from  50  constraints  (100  variables)  to  1000  constraints  (2000 
variables). 

To  examine  the  effect  of  problem  size  on  the  number  of  Newton  steps  required, 
we  generate  100  problem  instances  for  each  of  20  values  of  m,  ranging  from  to  =  10 
to  to  =  1000.  We  solve  each  of  these  2000  problems  using  the  barrier  method, 
noting  the  number  of  Newton  steps  required.  The  results  are  summarized  in  fig¬ 
ure  11.8,  which  shows  the  mean  and  standard  deviation  in  the  number  of  Newton 
steps,  for  each  value  of  m.  The  first  comment  we  make  is  that  the  standard  de¬ 
viation  is  around  2  iterations,  and  appears  to  be  approximately  independent  of 
problem  size.  Since  the  average  number  of  steps  required  is  near  25,  this  means 
that  the  number  of  Newton  steps  required  varies  only  around  ±10%. 

The  plot  shows  that  the  number  of  Newton  steps  required  grows  only  slightly, 
from  around  21  to  around  27,  as  the  problem  dimensions  increase  by  a  factor  of 
100.  This  behavior  is  typical  for  the  barrier  method  in  general:  The  number  of 
Newton  steps  required  grows  very  slowly  with  problem  dimensions,  and  is  almost 
always  around  a  few  tens.  Of  course,  the  computational  effort  to  carry  out  one 
Newton  step  grows  with  the  problem  dimensions. 
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Figure  11.7  Progress  of  barrier  method  for  three  randomly  generated  stan¬ 
dard  form  LPs  of  different  dimensions,  showing  duality  gap  versus  cumula¬ 
tive  number  of  Newton  steps.  The  number  of  variables  in  each  problem  is 
n  =  2m.  Here  too  we  see  approximately  linear  convergence  of  the  duality 
gap,  with  a  slight  increase  in  the  number  of  Newton  steps  required  for  the 
larger  problems. 


Figure  11.8  Average  number  of  Newton  steps  required  to  solve  100  randomly 
generated  LPs  of  different  dimensions,  with  n  =  2m.  Error  bars  show  stan¬ 
dard  deviation,  around  the  average  value,  for  each  value  of  m.  The  growth 
in  the  number  of  Newton  steps  required,  as  the  problem  dimensions  range 
over  a  100: 1  ratio,  is  very  small. 
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11.3.3  Convergence  analysis 


Convergence  analysis  for  the  barrier  method  is  straightforward.  Assuming  that 
tf0  +  (f>  can  be  minimized  by  Newton’s  method  for  t  =  t^°\  nt^°\  (i2^, . . .,  the 
duality  gap  after  the  initial  centering  step,  and  k  additional  centering  steps,  is 
Therefore  the  desired  accuracy  e  is  achieved  after  exactly 


log(ra/(ef(°))) 
log  /x 


(11.13) 


centering  steps,  plus  the  initial  centering  step. 

It  follows  that  the  barrier  method  works  provided  the  centering  problem  (11.6) 
is  solvable  by  Newton’s  method,  for  t  >  .  For  the  standard  Newton  method,  it 

suffices  that  for  t  >  v-°\  the  function  f/o  +  </>  satisfies  the  conditions  given  in  §10.2.4, 
page  529:  its  initial  sublevel  set  is  closed,  the  associated  inverse  KKT  matrix  is 
bounded,  and  the  Hessian  satisfies  a  Lipschitz  condition.  (Another  set  of  sufficient 
conditions,  based  on  self-concordance,  will  be  discussed  in  detail  in  §11.5.)  If  the 
infeasible  start  Newton  method  is  used  for  centering,  then  the  conditions  listed 
in  §10.3.3,  page  536,  are  sufficient  to  guarantee  convergence. 

Assuming  that  fo,  ■  ■  ■ ,  fm  are  closed,  a  simple  modification  of  the  original 
problem  ensures  that  these  conditions  hold.  By  adding  a  constraint  of  the  form 
INI  I  <  A2  to  the  problem,  it  follows  that  tf0  +  $  is  strongly  convex,  for  every 
t  >  0;  in  particular  convergence  of  Newton’s  method,  for  the  centering  steps,  is 
guaranteed.  (See  exercise  11.4.) 

While  this  analysis  shows  that  the  barrier  method  does  converge,  under  reason¬ 
able  assumptions,  it  does  not  address  a  basic  question:  As  the  parameter  t  increases, 
do  the  centering  problems  become  more  difficult  (and  therefore  take  more  and  more 
iterations)?  Numerical  evidence  suggests  that  for  a  wide  variety  of  problems,  this 
is  not  the  case;  the  centering  problems  appear  to  require  a  nearly  constant  number 
of  Newton  steps  to  solve,  even  as  t  increases.  We  will  see  (in  §11.5)  that  this  issue 
can  be  resolved,  for  problems  that  satisfy  certain  self-concordance  conditions. 


11.3.4  Newton  step  for  modified  KKT  equations 


In  the  barrier  method,  the  Newton  step  A:rnt,  and  associated  dual  variable  are 
given  by  the  linear  equations 


'  tV2/o(*)  +  VV(s) 

AT  ' 

tVf0(x)  +  V<j)(x) 

A 

0 

Z'nt 

0 

(11.14) 


In  this  section  we  show  how  these  Newton  steps  for  the  centering  problem  can  be 
interpreted  as  Newton  steps  for  directly  solving  the  modified  KKT  equations 

V/o(*)  +  ££i  +  =  0 

-A  ifi{x)  = 

Ax  = 


in  a  particular  way. 


0 

1/t,  i  =  1, ...  ,m 
b 


(11.15) 
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To  solve  the  modified  KKT  equations  (11.15),  which  is  a  set  of  n  +  p  +  m 
nonlinear  equations  in  the  n  +  p  +  m  variables  x,  v1  and  A,  we  first  eliminate  the 
variables  Aj,  using  Aj  =  —l/(tfi(x)).  This  yields 


Wo  (a)  +  _^1(a.)V/»(a;)  +  ATu  = 


0, 


Ax  =  b , 


(11.16) 


which  is  a  set  of  n  +  p  equations  in  the  n  +  p  variables  x  and  u. 

To  find  the  Newton  step  for  solving  the  set  of  nonlinear  equations  (11.16), 
we  form  the  Taylor  approximation  for  the  nonlinear  term  occurring  in  the  first 
equation.  For  v  small,  we  have  the  Taylor  approximation 


Vfo(x  +  v)  — 


1 


-  tfi(x  +  v ) 

1 


Vfi(x  +  v) 


«  V/0(s)  +  ^  'V.fc(s)  +  \72f0(x)v 

i= i  lJi\x) 

m  1  m  1 

1~[  -tfiW  ~r[  </iW 

The  Newton  step  is  obtained  by  replacing  the  nonlinear  term  in  equation  (11.16) 
by  this  Taylor  approximation,  which  yields  the  linear  equations 

Hv  +  ATu  =  —g,  Av  =  0,  (11.17) 


where 


m  m  1 

h  =  v2/o  W  +  + 


5  =  WoW  +  ^^gV/i^). 


Now  we  observe  that 


=  V2/0(x)  +  (1  /t)V2Hx),  9  =  V/0(x)  +  (l/t)V^(*), 

so,  from  (11.14),  the  Newton  steps  Axnt  and  unt  in  the  barrier  method  centering 
step  satisfy 

tHAxnt  +  ATunt  =  -tg,  AAxnt  =  0. 

Comparing  this  with  (11.17)  shows  that 

V  =  Axnt,  V=(l/t)vrA. 

This  shows  that  the  Newton  step  for  the  centering  problem  (11.6)  can  be  inter¬ 
preted,  after  scaling  the  dual  variable,  as  the  Newton  step  for  solving  the  modified 
KKT  equations  (11.16). 

In  this  approach,  we  first  eliminated  the  variable  A  from  the  modified  KKT 
equations,  and  then  applied  Newton’s  method  to  solve  the  resulting  set  of  equations. 
Another  variation  on  this  approach  is  to  directly  apply  Newton’s  method  to  the 
modified  KKT  equations,  without  first  eliminating  A.  This  method  yields  the  so- 
called  primal-dual  search  directions ,  discussed  in  §11.7. 
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11.4  Feasibility  and  phase  I  methods 

The  barrier  method  requires  a  strictly  feasible  starting  point  x^°h  When  such  a 
point  is  not  known,  the  barrier  method  is  preceded  by  a  preliminary  stage,  called 
phase  I ,  in  which  a  strictly  feasible  point  is  computed  (or  the  constraints  are  found 
to  be  infeasible).  The  strictly  feasible  point  found  during  phase  I  is  then  used  as 
the  starting  point  for  the  barrier  method,  which  is  called  the  phase  II  stage.  In 
this  section  we  describe  several  phase  I  methods. 


11.4.1  Basic  phase  I  method 

We  consider  a  set  of  inequalities  and  equalities  in  the  variables  x  £  R", 

fi(x)<  0,  1  =  1,..., m,  Ax  =  b,  (11.18) 

where  /)  :  Rn  R  are  convex,  with  continuous  second  derivatives.  We  assume 
that  we  are  given  a  point  a;®  £  dom  /i  D  -  -  -  D  dom/m,  with  Ax^  =  b. 

Our  goal  is  to  find  a  strictly  feasible  solution  of  these  inequalities  and  equalities, 
or  determine  that  none  exists.  To  do  this  we  form  the  following  optimization 
problem: 

minimize  s 

subject  to  fi{x)  <  s,  i  =  (11.19) 

Ax  =  b 

in  the  variables  x  £  R",  s  £  R.  The  variable  s  can  be  interpreted  as  a  bound  on 
the  maximum  infeasibility  of  the  inequalities;  the  goal  is  to  drive  the  maximum 
infeasibility  below  zero. 

This  problem  is  always  strictly  feasible,  since  we  can  choose  x^  as  starting 
point  for  x ,  and  for  s,  we  can  choose  any  number  larger  than  maxi=ir.ijm/j(a:^). 
We  can  therefore  apply  the  barrier  method  to  solve  the  problem  (11.19),  which  is 
called  the  phase  I  optimization  problem  associated  with  the  inequality  and  equality 
system  (11.19). 

We  can  distinguish  three  cases  depending  on  the  sign  of  the  optimal  value  p* 
of  (11.19). 

1.  If  p*  <  0,  then  (11.18)  has  a  strictly  feasible  solution.  Moreover  if  (x,s)  is 
feasible  for  (11.19)  with  s  <  0,  then  x  satisfies  fi(x)  <  0.  This  means  we  do 
not  need  to  solve  the  optimization  problem  (11.19)  with  high  accuracy;  we 
can  terminate  when  s  <  0. 

2.  If  p*  >  0,  then  (11.18)  is  infeasible.  As  in  case  1,  we  do  not  need  to  solve 
the  phase  I  optimization  problem  (11.19)  to  high  accuracy;  we  can  terminate 
when  a  dual  feasible  point  is  found  with  positive  dual  objective  (which  proves 
that  p *  >0).  In  this  case,  we  can  construct  the  alternative  that  proves  (11.18) 
is  infeasible  from  the  dual  feasible  point. 

3.  If  p*  =  0  and  the  minimum  is  attained  at  x *  and  s*  =  0,  then  the  set  of 
inequalities  is  feasible,  but  not  strictly  feasible.  If  p*  =  0  and  the  minimum 
is  not  attained,  then  the  inequalities  are  infeasible. 
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In  practice  it  is  impossible  to  determine  exactly  that  p*  =  0.  Instead,  an 
optimization  algorithm  applied  to  (11.19)  will  terminate  with  the  conclusion 
that  |p*  |  <  e  for  some  small,  positive  e.  This  allows  us  to  conclude  that  the 
inequalities  fi(x)  <  —e  are  infeasible,  while  the  inequalities  fi(x)  <  e  are 
feasible. 

Sum  of  infeasibilities 

There  are  many  variations  on  the  basic  phase  I  method  just  described.  One  method 
is  based  on  minimizing  the  sum  of  the  infeasibilities,  instead  of  the  maximum 
infeasibility.  We  form  the  problem 

minimize  lTs 

subject  to  fi(x)  <  Si,  i  =  1, ...  ,m  mom 

Ax  =  b  UA'“UJ 

s  y  o. 

For  fixed  x,  the  optimal  value  of  Sj  is  max{  fi(x),  0},  so  in  this  problem  we  are 
minimizing  the  sum  of  the  infeasibilities.  The  optimal  value  of  (11.20)  is  zero  and 
achieved  if  and  only  if  the  original  set  of  equalities  and  inequalities  is  feasible. 

This  sum  of  infeasibilities  phase  I  method  has  a  very  interesting  property  when 
the  system  of  equalities  and  inequalities  (11.19)  is  infeasible.  In  this  case,  the  op¬ 
timal  point  for  the  phase  I  problem  (11.20)  often  violates  only  a  small  number, 
say  r,  of  the  inequalities.  Therefore,  we  have  computed  a  point  that  satisfies  many 
(m  —  r)  of  the  inequalities,  i.e.,  we  have  identified  a  large  subset  of  inequalities 
that  is  feasible.  In  this  case,  the  dual  variables  associated  with  the  strictly  satisfied 
inequalities  are  zero,  so  we  have  also  proved  infeasibility  of  a  subset  of  the  inequal¬ 
ities.  This  is  more  informative  than  finding  that  the  m  inequalities,  together,  are 
mutually  infeasible.  (This  phenomenon  is  closely  related  to  i\ -norm  regularization, 
or  basis  pursuit,  used  to  find  sparse  approximate  solutions;  see  §6.1.2  and  §6.5.4). 


Example  11.4  Comparison  of  phase  I  methods.  We  apply  two  phase  I  methods  to 
an  infeasible  set  of  inequalities  Ax  <  b  with  dimensions  m  =  100,  n  =  50.  The  first 
method  is  the  basic  phase  I  method 

minimize  s 

subject  to  Ax  ■<  b  +  Is, 

which  minimizes  the  maximum  infeasibility.  The  second  method  minimizes  the  sum 
of  the  infeasibilities,  i.e.,  solves  the  LP 

minimize  lTs 
subject  to  Ax  <  b  +  s 
s  >  0. 

Figure  11.9  shows  the  distributions  of  the  infeasibilities  bi  —  ajx  for  these  two  values 
of  x,  denoted  a:max  and  iSUm,  respectively.  The  point  ®max  satisfies  39  of  the  100 
inequalities,  whereas  the  point  xSUm  satisfies  79  of  the  inequalities. 
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Figure  11.9  Distributions  of  the  infeasibilities  bi  —  aj x  for  an  infeasible  set 
of  100  inequalities  aj x  <  bi,  with  50  variables.  The  vector  a:max  used  in 
the  left  plot  was  obtained  by  the  basic  phase  I  algorithm.  It  satisfies  39 
of  the  100  inequalities.  In  the  right  plot  the  vector  a;SUm  was  obtained  by 
minimizing  the  sum  of  the  infeasibilities.  This  vector  satisfies  79  of  the  100 
inequalities. 


Termination  near  the  phase  II  central  path 

A  simple  variation  on  the  basic  phase  I  method,  using  the  barrier  method,  has 
the  property  that  (when  the  equalities  and  inequalities  are  strictly  feasible)  the 
central  path  for  the  phase  I  problem  intersects  the  central  path  for  the  original 
optimization  problem  (11.1). 

We  assume  a  point  x ^  €  V  =  dom  /o  fl  dom  /i  (~l  •  •  •  fl  dom  fm,  with  Ax ^  =  b 
is  given.  We  form  the  phase  I  optimization  problem 

minimize  s 

subject  to  ft(x)<s ,  i  =  1, . . . ,  m  moil 

f0(x)  <  M 
Ax  =  b, 


where  M  is  a  constant  chosen  to  be  larger  than  max{/o(a:^0^),p*}. 

We  assume  now  that  the  original  problem  (11.1)  is  strictly  feasible,  so  the 
optimal  value  p *  of  (11.21)  is  negative.  The  central  path  of  (11.21)  is  characterized 
by 


^s-/*(a 0 


1  m  1 


where  t  is  the  parameter.  If  (x,  s)  is  on  the  central  path  and  s  =  0,  then  x  and  v 
satisfy 

m 

tVf0(x)  +  ^2  .V/i(x)  +  ATv  =  0 

j_  i  Ji\X) 

for  t  =  1/(M  —  fo(x)).  This  means  that  x  is  on  the  central  path  for  the  original 
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optimization  problem  (11.1),  with  associated  duality  gap 

m(M  —  fo(x))  <  m(M  —  p*).  (11.22) 


11.4.2  Phase  I  via  infeasible  start  Newton  method 

We  can  also  carry  out  the  phase  I  stage  using  an  infeasible  start  Newton  method, 
applied  to  a  modified  version  of  the  original  problem 

minimize  fo(x) 

subject  to  fi(x)  <0,  i  =  1, . . . ,  m 
Ax  =  b. 

We  first  express  the  problem  in  the  (obviously  equivalent)  form 
minimize  fo  (x) 

subject  to  fi{x)  <  s,  i  =  1, . . .  ,m 
Ax  =  b,  s  =  0, 

with  the  additional  variable  s  G  R.  To  start  the  barrier  method,  we  use  an  infeasible 
start  Newton  method  to  solve 

minimize  i(0) /0(x)  -  YT=i  log(s  -  Mx)) 
subject  to  Ax  =  6,  s  =  0. 

This  can  be  initialized  with  any  x  G  T>,  and  any  s  >  nraxj/j(x).  Provided  the 
problem  is  strictly  feasible,  the  infeasible  start  Newton  method  will  eventually 
take  an  undamped  step,  and  thereafter  we  will  have  s  =  0,  i.e.,  x  strictly  feasible. 

The  same  trick  can  be  applied  if  a  point  in  V,  the  common  domain  of  the 
functions,  is  not  known.  We  simply  apply  the  infeasible  start  Newton  method  to 
the  problem 

minimize  t(0)/o(x  +  z0)  -  Yh=i  log(s  “  Mx  +  2*)) 
subject  to  Ax  =  6,  s  =  0,  zq  =  0,  . . . ,  zm  =  0 

with  variables  x,  Zo, ... ,  zm,  and  s  £  R.  We  initialize  z,  so  that  x  +  z*  G  dom  /, . 

The  main  disadvantage  of  this  approach  to  the  phase  I  problem  is  that  there  is 
no  good  stopping  criterion  when  the  problem  is  infeasible;  the  residual  simply  fails 
to  converge  to  zero. 


11.4.3  Examples 

We  consider  a  family  of  linear  feasibility  problems, 

^4x  A  b("f) 

where  A  G  j^50x20  anc}  =  b  +  yA b.  The  problem  data  are  chosen  so  that  the 
inequalities  are  strictly  feasible  for  7  >  0  and  infeasible  for  7  <  0.  For  7  =  0  the 
problem  is  feasible  but  not  strictly  feasible. 
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Figure  11.10  shows  the  total  number  of  Newton  steps  required  to  find  a  strictly 
feasible  point,  or  a  certificate  of  infeasibility,  for  40  values  of  7  in  [—1,1].  We  use 
the  basic  phase  I  method  of  §11.4.1,  i.e .,  for  each  value  of  7,  we  form  the  LP 

minimize  s 

subject  to  Ax  A  b( 7)  +  si. 

The  barrier  method  is  used  with  /i  =  10,  and  starting  point  x  =  0,  s  =  —  min^  6,; (7)  + 
1.  The  method  terminates  when  a  point  (x,  s)  with  s  <  0  is  found,  or  a  feasible 
solution  2:  of  the  dual  problem 

maximize  —b('j)Tz 
subject  to  ATz  =  0 
1  Tz  =  l 

zhO 


is  found  with  —b('y)Tz  >  0. 

The  plot  shows  that  when  the  inequalities  are  feasible,  with  some  margin,  it 
takes  around  25  Newton  steps  to  produce  a  strictly  feasible  point.  Conversely, 
when  the  inequalities  are  infeasible,  again  with  some  margin,  it  takes  around  35 
steps  to  produce  a  certificate  proving  infeasibility.  The  phase  I  effort  increases  as 
the  set  of  inequalities  approaches  the  boundary  between  feasible  and  infeasible, 
i.e.,  7  near  zero.  When  7  is  very  near  zero,  so  the  inequalities  are  very  near  the 
boundary  between  feasible  and  infeasible,  the  number  of  steps  grows  substantially. 
Figure  11.11  shows  the  total  number  of  Newton  steps  required  for  values  of  7 
near  zero.  The  plots  show  an  approximately  logarithmic  increase  in  the  number 
of  steps  required  to  detect  feasibility,  or  prove  infeasibility,  for  problems  very  near 
the  boundary  between  feasible  and  infeasible. 

This  example  is  typical:  The  cost  of  solving  a  set  of  convex  inequalities  and 
linear  equalities  using  the  barrier  method  is  modest,  and  approximately  constant, 
as  long  as  the  problem  is  not  very  close  to  the  boundary  between  feasibility  and 
infeasibility.  When  the  problem  is  very  close  to  the  boundary,  the  number  of 
Newton  steps  required  to  find  a  strictly  feasible  point  or  produce  a  certificate 
of  infeasibility  grows.  When  the  problem  is  exactly  on  the  boundary  between 
strictly  feasible  and  infeasible,  for  example,  feasible  but  not  strictly  feasible,  the 
cost  becomes  infinite. 

Feasibility  using  infeasible  start  Newton  method 

We  also  solve  the  same  set  of  feasibility  problems  using  the  infeasible  start  Newton 
method,  applied  to  the  problem 

minimize  —  YhL  1  log  si 

subject  to  Ax  +  s  =  b( 7). 

We  use  backtracking  parameters  a  =  0.01,  /3  =  0.9,  and  initialize  with  =  0, 
s*'0'*  =  1,  i/°)  =  0.  We  consider  only  feasible  problems  {i.e.,  7  >  0)  and  terminate 
once  a  feasible  point  is  found.  (We  do  not  consider  infeasible  problems,  since  in 
that  case  the  residual  simply  converges  to  a  positive  number.)  Figure  11.12  shows 
the  number  of  Newton  steps  required  to  find  a  feasible  point,  as  a  function  of  7. 
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Figure  11.10  Number  of  Newton  iterations  required  to  detect  feasibility  or 
infeasibility  of  a  set  of  linear  inequalities  Ax  A  b  +  7  A b  parametrized  by 
7  6  R.  The  inequalities  are  strictly  feasible  for  7  >  0,  and  infeasible  for 
7  <  0.  For  7  larger  than  around  0.2,  about  30  steps  are  required  to  compute 
a  strictly  feasible  point;  for  7  less  than  —0.5  or  so,  it  takes  around  35  steps 
to  produce  a  certificate  proving  infeasibility.  For  values  of  7  in  between,  and 
especially  near  zero,  more  Newton  steps  are  required  to  determine  feasibility. 


Figure  11.11  Left.  Number  of  Newton  iterations  required  to  find  a  proof  of 
infeasibility  versus  7,  for  7  small  and  negative.  Right.  Number  of  Newton 
iterations  required  to  find  a  strictly  feasible  point  versus  7,  for  7  small  and 
positive. 
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Figure  11.12  Number  of  iterations  required  to  find  a  feasible  point  for  a  set 
of  linear  inequalities  Ax  ■<  b  +  7 A b  parametrized  by  7  £  R.  The  infeasible 
start  Newton  method  is  used,  and  terminated  when  a  feasible  point  is  found. 
For  7  =  10,  the  starting  point  x ®  =  0  happened  to  be  feasible  (0  iterations). 


The  plot  shows  that  for  7  larger  than  0.3  or  so,  it  takes  fewer  than  20  Newton 
steps  to  find  a  feasible  point.  In  these  cases  the  method  is  more  efficient  than  a 
phase  I  method,  which  takes  a  total  of  around  30  Newton  steps.  For  smaller  values 
of  7,  the  number  of  Newton  steps  required  grows  dramatically,  approximately  as 
I/7.  For  7  =  0.01,  the  infeasible  start  Newton  method  requires  several  thousand 
iterations  to  produce  a  feasible  point.  In  this  region  the  phase  I  approach  is  far 
more  efficient,  requiring  only  40  iterations  or  so. 

These  results  are  quite  typical.  The  infeasible  start  Newton  method  works 
very  well  provided  the  inequalities  are  feasible,  and  not  very  close  to  the  boundary 
between  feasible  and  infeasible.  But  when  the  feasible  set  is  just  barely  nonempty 
(as  is  the  case  in  this  example  with  small  7),  a  phase  I  method  is  far  better.  Another 
advantage  of  the  phase  I  method  is  that  it  gracefully  handles  the  infeasible  case; 
the  infeasible  start  Newton  method,  in  contrast,  simply  fails  to  converge. 


11.5  Complexity  analysis  via  self-concordance 

Using  the  complexity  analysis  of  Newton’s  method  for  self-concordant  functions 
(§9.6.4,  page  503,  and  §10.2.4,  page  531),  we  can  give  a  complexity  analysis  of 
the  barrier  method.  The  analysis  applies  to  many  common  problems,  and  leads 
to  several  interesting  conclusions:  It  gives  a  rigorous  bound  on  the  total  number 
of  Newton  steps  required  to  solve  a  problem  using  the  barrier  method,  and  it 
justifies  our  observation  that  the  centering  problems  do  not  become  more  difficult 
as  t  increases. 
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11.5.1  Self-concordance  assumption 


We  make  two  assumptions. 

•  The  function  tfo  +  </>  is  closed  and  self-concordant  for  all  t  >  t 1°1 . 

•  The  sublevel  sets  of  (11.1)  are  bounded. 

The  second  assumption  implies  that  the  centering  problem  has  bounded  sublevel 
sets  (see  exercise  11.3),  and,  therefore,  the  centering  problem  is  solvable.  The 
bounded  sublevel  set  assumption  also  implies  that  the  Hessian  of  tfo  +  cf>  is  positive 
definite  everywhere  (see  exercise  11.14).  While  the  self-concordance  assumption 
restricts  the  complexity  analysis  to  a  particular  class  of  problems,  it  is  important 
to  emphasize  that  the  barrier  method  works  well  in  general,  whether  or  not  the 
self-concordance  assumption  holds. 

The  self-concordance  assumption  holds  for  a  variety  of  problems,  including  all 
linear  and  quadratic  problems.  If  the  functions  fi  are  linear  or  quadratic,  then 

m 

tfo  -  log (-/*) 

i= 1 


is  self-concordant  for  all  values  of  t  >  0  (see  §9.6).  The  complexity  analysis  given 
below  therefore  applies  to  LPs,  QPs,  and  QCQPs. 

In  other  cases,  it  is  possible  to  reformulate  the  problem  so  the  assumption  of 
self-concordance  holds.  As  an  example,  consider  the  linear  inequality  constrained 
entropy  maximization  problem 

minimize  X0"=i  xi  1°S  xi 
subject  to  Fx  <  g 
Ax  =  b. 

The  function 

n  m 

tfo(x) + <t>{x) =t^2,xi\ogxi  -  y, log (ff»  -  fix), 

i= 1  i= 1 


where  ff, . . . ,  fjrl  are  the  rows  of  F,  is  not  closed  (unless  Fx  A  g  implies  x  F  0),  or 
self-concordant.  We  can,  however,  add  the  redundant  inequality  constraints  x  >z  0 
to  obtain  the  equivalent  problem 


minimize  Yli-i  xi  log  x j 

subject  to  Fx  A  g 
Ax  =  b 
x  >z  0. 


(11.23) 


For  this  problem  we  have 


n  n  m 

tfo{x)  +  (fix)  =tJ2xi  log  a;,  -  5Z  1°S  x%  -  ~  f?x)’ 

i= 1  i= 1  i= 1 
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which  is  self-concordant  and  closed,  for  any  t  >  0.  (The  function  tylog  y  —  log  y 
is  self-concordant  on  R++,  for  all  t  >  0;  see  exercise  11.13.)  The  complexity 
analysis  therefore  applies  to  the  reformulated  linear  inequality  constrained  entropy 
maximization  problem  (11.23). 

As  a  more  exotic  example,  consider  the  GP 

minimize  /0(x)  =  log  (j2kl l  exP(aa kx  +  bo fc)) 
subject  to  log  exP(aIkx  +  &»fc))  <  0,  i  =  1, . . . ,  m. 

It  is  not  clear  whether  or  not  the  function 

/  K0  \  m  /  Ki 

tfo(x)  +  </>(x)  =  t  log  E  exp(a,Qkx  +  bok)  -  E log  _  log  E  exp (afkx  +  bik) 
\fe= i  /  *= i  V  fc= i 

is  self-concordant,  so  although  the  barrier  method  works,  the  complexity  analysis 
of  this  section  need  not  hold. 

We  can,  however,  reformulate  the  GP  in  a  form  that  definitely  satisfies  the  self¬ 
concordance  assumption.  For  each  (monomial)  term  exp(a^,x  +  bik)  we  introduce 
a  new  variable  yik  that  serves  as  an  upper  bound, 


exp (ajkx  +  bik)  <  yik. 


Using  these  new  variables  we  can  express  the  GP  in  the  form 

minimize  J2k=i  Vo k 
subject  to  Yxk=i  Uik  <  1)  i  =  1, . . . ,  to 

aIkx  +  bik  -  log  yik  <  0,  i  =  0, . . . ,  to, 
yik  >  0,  i  =  0, . . . ,  to,  k  =  l,...,Ki. 

The  associated  logarithmic  barrier  is 

m  Ki  m 

EE  (-  log  Uik  ~  log(log  yik  -  afkx  -&»*))-  E log 

2=0  k—  1  2=1 

which  is  closed  and  self-concordant  (example  9.8,  page  500).  Since  the  objective  is 
linear,  it  follows  that  tf0  +  <f>  is  closed  and  self-concordant  for  any  t. 


k  =  l,..JuKi 


'  Ki  \ 

i  -  E  y*k 

\  *:=i  j 


11.5.2  Newton  iterations  per  centering  step 

The  complexity  theory  of  Newton’s  method  for  self-concordant  functions,  developed 
in  §9.6.4  (page  503)  and  §10.2.4  (page  531),  shows  that  the  number  of  Newton 
iterations  required  to  minimize  a  closed  strictly  convex  self-concordant  function  / 
is  bounded  above  by 

f{x)  ~  V *  , 

- 1-  c. 


7 


(11.24) 
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Here  x  is  the  starting  point  for  Newton’s  method,  and  p*  =  info  f(x)  is  the  optimal 
value.  The  constant  7  depends  only  on  the  backtracking  parameters  a  and  /3,  and 
is  given  by 

1  _  20  -8a 

7  a{3(  1  —  2a)2 

The  constant  c  depends  only  on  the  tolerance  ent, 

c  =  log2  log2(l/ent), 

and  can  reasonably  be  approximated  as  c  =  6.  The  expression  (11.24)  is  a  quite 
conservative  bound  on  the  number  of  Newton  steps  required,  but  our  interest  in  this 
section  is  only  to  establish  a  complexity  bound,  concentrating  on  how  it  increases 
with  problem  size  and  algorithm  parameters. 

In  this  section  we  use  this  result  to  derive  a  bound  on  the  number  of  Newton 
steps  required  for  one  outer  iteration  of  the  barrier  method,  i.e.,  for  computing 
x*(pt),  starting  from  x*(t).  To  lighten  the  notation  we  use  x  to  denote  x*(t),  the 
current  iterate,  and  we  use  x+  to  denote  x*(pt),  the  next  iterate.  We  use  A  and  v 
to  denote  A *(t)  and  v*(t),  respectively. 

The  self-concordance  assumption  implies  that 

ptf0(x)  +  (j)(x)  -  ptf0(x+)  -  (t>(x+)  |c  (11  T5) 

7 

is  an  upper  bound  on  the  number  of  Newton  steps  required  to  compute  £+  =  x*(p,t), 
starting  at  x  =  x*(t).  Unfortunately  we  do  not  know  x+ ,  and  hence  the  upper 
bound  (11.25),  until  we  actually  compute  x+ ,  i.e.,  carry  out  the  Newton  algorithm 
(whereupon  we  know  the  exact  number  of  Newton  steps  required  to  compute  x*(pt), 
which  defeats  the  purpose).  We  can,  however,  derive  an  upper  bound  on  (11.25), 
as  follows: 


l-itf0(x)  +  </>(x)  -  ptf0(x+)  -  (j){x+ ) 

m 

=  fltfoix)  -  ptfo(x+)  +  \og(- pt\.lfi{x+))  -To  log  fJ, 


i=  1 


<  ptfo{x)  -  ptfo(x+)  -  Xifi(x+)  -  TO.  -  TO  log  fl 


i—1 


=  pt-Mx)  -  A it  (  f0(x+)  +  ^2  A ifi(x+)  +  vt(Ax+  -  b)  j  to  to  log  p 


i= 1 


<  ptfo{x)  —  ptg(\,  v)  —  m  —  TO-log/i 
=  m(p  —  1  —  log  p.). 


This  chain  of  equalities  and  inequalities  needs  some  explanation.  To  obtain  the 
second  line  from  the  first,  we  use  Aj  =  —l/(tfi(x)).  In  the  first  inequality  we  use 
the  fact  that  log  a  <  a  —  1  for  a  >  0.  To  obtain  the  fourth  line  from  the  third,  we 
use  Ax+  =  b ,  so  the  extra  term  vT {Ax+  —  b)  is  zero.  The  second  inequality  follows 
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Figure  11.13  The  function  fi  —  1  —  log  /.t,  versus  fi.  The  number  of  Newton 
steps  required  for  one  outer  iteration  of  the  barrier  method  is  bounded  by 
(' m/7)( n  -  1  -  log  p)  +  c. 


from  the  definition  of  the  dual  function: 

/  m  N 

g{  \v)  =  inf  If 0(z)  + ^2  Xifi(z)  +  uT(Az-b) 


i= 1 


<  fp(x+)  +  Y,  A ifi{x+)  +  vt{Ax+  -  b). 


The  last  line  follows  from  g( A,  v)  =  fo(x)  —  m/t. 
The  conclusion  is  that 

m{n  -  1  -  log  n) 

- b  C 

7 


(11.26) 


is  an  upper  bound  on  (11.25),  and  therefore  an  upper  bound  on  the  number  of 
Newton  steps  required  for  one  outer  iteration  of  the  barrier  method.  The  function 
fi  —  1  —  log/r  is  shown  in  figure  11.13.  For  small  /i  it  is  approximately  quadratic; 
for  large  /i  it  grows  approximately  linearly.  This  fits  with  our  intuition  that  for  fi 
near  one,  the  number  of  Newton  steps  required  to  center  is  small,  whereas  for  large 
/r,  it  could  well  grow. 

The  bound  (11.26)  shows  that  the  number  of  Newton  steps  required  in  each 
centering  step  is  bounded  by  a  quantity  that  depends  mostly  on  /i,  the  factor  by 
which  t  is  updated  in  each  outer  step  of  the  barrier  method,  and  m ,  the  number  of 
inequality  constraints  in  the  problem.  It  also  depends,  weakly,  on  the  parameters 
a  and  /3  used  in  the  line  search  for  the  inner  iterations,  and  in  a  very  weak  way 
on  the  tolerance  used  to  terminate  the  inner  iterations.  It  is  interesting  to  note 
that  the  bound  does  not  depend  on  n,  the  dimension  of  the  variable,  or  p ,  the 
number  of  equality  constraints,  or  the  particular  values  of  the  problem  data,  i.e., 
the  objective  and  constraint  functions  (provided  the  self-concordance  assumption 
in  §11.5.1  holds).  Finally,  we  note  that  it  does  not  depend  on  <;  in  particular,  as 
t  — >  oo,  a  uniform  bound  on  the  number  of  Newton  steps  per  outer  iteration  holds. 
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11.5.3  Total  number  of  Newton  iterations 

We  can  now  give  an  upper  bound  on  the  total  number  of  Newton  steps  in  the  barrier 
method,  not  counting  the  initial  centering  step  (which  we  will  analyze  later,  as  part 
of  phase  I).  We  multiply  (11.26),  which  bounds  the  number  of  Newton  steps  per 
outer  iteration,  by  (11.13),  the  number  of  outer  steps  required,  to  obtain 

|~log(m/(f(°)e))]  f  m(p  —  1  —  log/x) 

N=  - - - - he 

log  p  V  7 

an  upper  bound  on  the  total  number  of  Newton  steps  required.  This  formula 
shows  that  when  the  self-concordance  assumption  holds,  we  can  bound  the  number 
of  Newton  steps  required  by  the  barrier  method,  for  any  value  of  p  >  1. 

If  we  fix  /x  and  to,  the  bound  N  is  proportional  to  log(?n/(t(0^e)),  which  is  the 
log  of  the  ratio  of  the  initial  duality  gap  m/t ^  to  the  final  duality  gap  e,  i.e.,  the 
log  of  the  required  duality  gap  reduction.  We  can  therefore  say  that  the  barrier 
method  converges  at  least  linearly,  since  the  number  of  steps  required  to  reach  a 
given  precision  grows  logarithmically  with  the  inverse  of  the  precision. 

If  /x,  and  the  required  duality  gap  reduction  factor,  are  fixed,  the  bound  N  grows 
linearly  with  to,  the  number  of  inequalities.  The  bound  N  is  independent  of  the 
other  problem  dimensions  n  and  p ,  and  the  particular  problem  data  or  functions. 
We  will  see  below  that  by  a  particular  choice  of  p,  that  depends  on  to,  we  can 
obtain  a  bound  on  the  number  of  Newton  steps  that  grows  only  as  y/rn,  instead  of 
as  to. 

Finally,  we  analyze  the  bound  N  as  a  function  of  the  algorithm  parameter 
p.  As  p  approaches  one,  the  first  term  in  N  grows  large,  and  therefore  so  does 
N .  This  is  consistent  with  our  intuition  and  observation  that  for  /x  near  one,  the 
number  of  outer  iterations  is  very  large.  As  /x  becomes  large,  the  bound  N  grows 
approximately  as  /x/log/x,  this  time  because  the  bound  on  the  number  of  Newton 
iterations  required  per  outer  iteration  grows.  This,  too,  is  consistent  with  our 
observations.  As  a  result,  the  bound  N  has  a  minimum  value  as  a  function  of  /x. 

The  variation  of  the  bound  with  the  parameter  /x  is  illustrated  in  figure  11.14, 
which  shows  the  bound  (11.27)  versus  /x  for  the  values 

c  =  6,  7  =  1/375,  TO/(f(0)e)  =  105,  to  =  100. 

The  bound  is  qualitatively  consistent  with  intuition,  and  our  observations:  it  grows 
very  large  as  p  approaches  one,  and  increases,  more  slowly,  as  p  becomes  large.  The 
bound  N  has  a  minimum  at  /x  ss  1.02,  which  gives  a  bound  on  the  total  number 
of  Newton  iterations  around  8000.  The  complexity  analysis  of  Newton’s  method  is 
conservative,  but  the  basic  trade-off  in  the  choice  of  p  is  reflected  in  the  plot.  (In 
practice,  far  larger  values  of  p,  from  around  2  to  100,  work  very  well,  and  require 
a  total  number  of  Newton  iterations  on  the  order  of  a  few  tens.) 

Choosing  p  as  a  function  of  to 

When  p  (and  the  required  duality  gap  reduction)  is  fixed,  the  bound  (11.27)  grows 
linearly  with  m,  the  number  of  inequalities.  It  turns  out  we  can  obtain  a  better 


(11.27) 
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Figure  11.14  The  upper  bound  N  on  the  total  number  of  Newton  iterations, 
given  by  equation  (11.27),  for  c  =  6,  7  =  1/375,  m  =  100,  and  a  duality  gap 
reduction  factor  m/(t^e)  =  10s,  versus  the  barrier  algorithm  parameter  fj,. 


exponent  for  m  by  making  /.t  a  function  of  m.  Suppose  we  choose 

/t  =  1  +  1  /y/rn.  (11.28) 

Then  we  can  bound  the  second  term  in  (11.27)  as 

At— 1  — logAt  =  l/\/m  —  log(l  +  1/y/m) 

<  l/y/m—l/y/rn  +  l/{2m) 

=  1/(2  m) 

(using  —  log(l  +  a)  <  —  a  +  a2 / 2  for  a  >  0).  Using  concavity  of  the  logarithm,  we 
also  have 

log  At  =  log(l  +  1  /y/m)  >  (log  2 )/y/rn. 

Using  these  inequalities  we  can  bound  the  total  number  of  Newton  steps  by 


N  < 


< 


log  (m./(t^e)) 
log  At 

log(m/(t^°^e)) 


t (At  -  1  -  log  At) 

7 


m- 


log  2 

=  \/mlog2(m/(<(0)e)) 

<  Cl  +  C2\/to, 


1 

27+C 
1 

2^  +  C 


(11.29) 


Cl 


=  ^  +  c,  c2  =  log2(?n/(f(0)e))  +  cj  ■ 


where 
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Here  C\  depends  (and  only  weakly)  on  algorithm  parameters  for  the  centering 
Newton  steps,  and  c2  depends  on  these  and  the  required  duality  gap  reduction. 
Note  that  the  term  log2(m/ (t^e))  is  exactly  the  number  of  bits  of  required  duality 
gap  reduction. 

For  fixed  duality  gap  reduction,  the  bound  (11.29)  grows  as  \fm,  whereas  the 
bound  N  in  (11.27)  grows  like  m,  if  the  parameter  p  is  held  constant.  For  this 
reason  the  barrier  method,  with  parameter  value  (11.28),  is  said  to  be  an  order 
y/m  method. 

In  practice,  we  would  not  use  the  value  fi  =  1  +  1  / y/rri,  which  is  far  too  small, 
or  even  decrease  /i  as  a  function  of  m.  Our  only  interest  in  this  value  of  p  is  that 
it  (approximately)  minimizes  our  (very  conservative)  upper  bound  on  the  number 
of  Newton  steps,  and  yields  an  overall  estimate  that  grows  as  \/rn,  instead  of  m. 


11.5.4  Feasibility  problems 

In  this  section  we  analyze  the  complexity  of  a  (minor)  variation  on  the  basic  phase  I 
method  described  in  §11.4.1,  used  to  solve  a  set  of  convex  inequalities, 

fi(x)  <  0,  ...,  fm(x)<0,  (11.30) 

where  /i, . . . ,  fm  are  convex,  with  continuous  second  derivatives.  (We  will  consider 
equality  constraints  later.)  We  assume  that  the  phase  I  problem 

minimize  s  ,  . 

subject  to  fi(x)  <  s,  i  =  l,...,m  \  ) 

satisfies  the  conditions  in  §11.5.1.  In  particular  we  assume  that  the  feasible  set  of 
the  inequalities  (11.30)  (which  of  course  can  be  empty)  is  contained  in  a  Euclidean 
ball  of  radius  R: 


{£  |  fi(x)  <0,  i  =  1,. . .  ,m}  C  {x  |  ||x||2  <  R}. 

We  can  interpret  R  as  a  prior  bound  on  the  norm  of  any  point  in  the  feasible  set  of 
the  inequalities.  This  assumption  implies  that  the  sublevel  sets  of  the  phase  I  prob¬ 
lem  are  bounded.  Without  loss  of  generality,  we  will  start  the  phase  I  method  at  the 
point  x  =  0.  We  define  F  =  maxj  /,( 0),  which  is  the  maximum  constraint  violation, 
assumed  to  be  positive  (since  otherwise  x  =  0  satisfies  the  inequalities  (11.30)). 

We  define  p*  as  the  optimal  value  of  the  phase  I  optimization  problem  (11.31). 
The  sign  of  p*  determines  whether  or  not  the  set  of  inequalities  (11.30)  is  feasible. 
The  magnitude  of  p*  also  has  a  meaning.  If  p *  is  positive  and  large  (say,  near  F, 
the  largest  value  it  can  have)  it  means  that  the  set  of  inequalities  is  quite  infeasible, 
in  the  sense  that  for  each  x,  at  least  one  of  the  inequalities  is  substantially  violated 
(by  at  least  p*).  On  the  other  hand,  if  p*  is  negative  and  large,  it  means  that 
the  set  of  inequalities  is  quite  feasible,  in  the  sense  that  there  is  not  only  an  x  for 
which  fi(x)  are  all  nonpositive,  but  in  fact  there  is  an  x  for  which  /,(x)  are  all  quite 
negative  (no  more  than  p*).  Thus,  the  magnitude  \p*\  is  a  measure  of  how  clearly 
the  set  of  inequalities  is  feasible  or  infeasible,  and  therefore  related  to  the  difficulty 
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of  determining  feasibility  of  the  inequalities  (11.30).  In  particular,  if  \p*\  is  small, 
it  means  the  problem  is  near  the  boundary  between  feasibility  and  infeasibility. 

To  determine  feasibility  of  the  inequalities,  we  use  a  variation  on  the  basic 
phase  I  problem  (11.31).  We  add  a  redundant  linear  inequality  aTx  <  1,  to  obtain 

minimize  s 

subject  to  fi{x)  <  s,  i  =  (11.32) 

aTx  <  1. 


We  will  specify  a  later.  Our  choice  will  satisfy  ||a||2  <  1  /R,  so  ||x||2  <  R  implies 
aT x  <  1,  i.e.,  the  extra  constraint  is  redundant. 

We  will  choose  a  and  s o  so  that  x  =  0,  s  =  So  is  on  the  central  path  of  the 
problem  (11.32),  with  a  parameter  value  t^°\  i.e.,  they  minimize 


f(0)s  -  ^  log(s  -  /<(*))  -  log(l  -  aTx). 

i=  1 

Setting  to  zero  the  derivative  with  respect  to  s,  we  get 


f(°)  =  y  — - 


so  -  fi( 0) 

Setting  to  zero  the  gradient  with  respect  to  x  yields 


«  =  -£ 


i 

So  -  MO) 


V/i(0). 


(11.33) 


(11.34) 


So  it  remains  only  to  pick  the  parameter  so;  once  we  have  chosen  so,  the  vector  a 
is  given  by  (11.34),  and  the  parameter  is  given  by  (11.33).  Since  x  =  0  and 
s  =  so  must  be  strictly  feasible  for  the  phase  I  problem  (11.32),  we  must  choose 
s0  >  F. 

We  must  also  pick  so  to  make  sure  that  ||a||2  <  1  / R.  From  (11.34),  we  have 


ihi2<e 

i=l 


1 

So  -  MO) 


II  V.A:  (0)11  < 


mG 
s0-F ’ 


where  G  =  max^  ||V/j(0)||2.  Therefore  we  can  take  so  =  mGR  +  F,  which  ensures 
IMI2  <  1  / R,  so  the  extra  linear  inequality  is  redundant. 

Using  (11.33),  we  have 


*(°)  =  V _ - _  > 

^  mGR  +  F-  MO)  ~  mGR’ 

since  F  =  max^  /i(0).  Thus  x  =  0,  s  =  so  are  on  the  central  path  for  the  phase  I 
problem  (11.32),  with  initial  duality  gap 

771  -I-  1 

<  (m  +  1  )mGR. 
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To  solve  the  original  inequalities  (11.30)  we  need  to  determine  the  sign  of  p*. 
We  can  stop  when  either  the  primal  objective  value  of  (11.32)  is  negative,  or  the 
dual  objective  value  is  positive.  One  of  these  two  cases  must  occur  when  the  duality 
gap  for  (11.32)  is  less  than  |p*|. 

We  use  the  barrier  method  to  solve  (11.32),  starting  from  a  central  point  with 
duality  gap  no  more  than  (to  +  1  )mGR,  and  terminating  when  (or  before)  the 
duality  gap  is  less  than  \p*\.  Using  the  results  of  the  previous  section,  this  requires 
no  more  than 


y/m  +  1  log2 


to(to  +  1  )GR 

Ip*  I 


(11.35) 


Newton  steps.  (Here  we  take  /i  =  1  +  1/y/m  +  1,  which  gives  a  better  complexity 
exponent  for  to  than  a  fixed  value  of  /z.) 

The  bound  (11.35)  grows  only  slightly  faster  than  y/rn,  and  depends  weakly  on 
the  algorithm  parameters  used  in  the  centering  steps.  It  is  approximately  propor¬ 
tional  to  log2 ( {GR)/ \p* |),  which  can  be  interpreted  as  a  measure  of  how  difficult 
the  particular  feasibility  problem  is,  or  how  close  it  is  to  the  boundary  between 
feasibility  and  infeasibility. 


Feasibility  problems  with  equality  constraints 

We  can  apply  the  same  analysis  to  feasibility  problems  that  include  equality  con¬ 
straints,  by  eliminating  the  equality  constraints.  This  does  not  affect  the  self¬ 
concordance  of  the  problem,  but  it  does  mean  that  G  and  R  refer  to  the  reduced, 
or  eliminated,  problem. 


11.5.5  Combined  phase  I/phase  II  complexity 

In  this  section  we  give  an  end-to-end  complexity  analysis  for  solving  the  problem 
minimize  fo(x) 

subject  to  fi(x)  <0,  i  =  1, . . . ,  to 
Ax  =  b 

using  (a  variation  on)  the  barrier  method.  First  we  solve  the  phase  I  problem 
minimize  s 

subject  to  fi{x)  <  s,  i  =  1, ...  ,m 
fo(x)  <  M 
Ax  =  b 
aTx  <  1, 

which  we  assume  satisfies  the  self-concordance  and  bounded  sublevel  set  assump¬ 
tions  of  §11.5.1.  Here  we  have  added  two  redundant  inequalities  to  the  basic  phase  I 
problem.  The  constraint  fo{x)  <  M  is  added  to  guarantee  that  the  phase  I  cen¬ 
tral  path  intersects  the  central  path  for  phase  II,  as  described  in  section  §11.4.1 
(see  (11.21)).  The  number  M  is  a  prior  bound  on  the  optimal  value  of  the  problem. 
The  second  added  constraint  is  the  linear  inequality  aTx  <  1,  where  a  is  chosen 
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as  described  in  §11.5.4.  We  use  the  barrier  method  to  solve  this  problem,  with 
H  =  1  +  1/y/m  +  2,  and  the  starting  points  x  =  0,  s  =  sq  given  in  §11.5.4. 

To  either  find  a  strictly  feasible  point,  or  determine  the  problem  is  infeasible, 
requires  no  more  than 


Ni 


y/m  +  2  log2 


(m  +  1  )(m  +  2  )GR 

W\ 


(11.36) 


Newton  steps,  where  G  and  R  are  as  given  in  11.5.4.  If  the  problem  is  infeasible 
we  are  done;  if  it  is  feasible,  then  we  find  a  point  in  phase  I,  associated  with  s  =  0, 
that  lies  on  the  central  path  of  the  phase  II  problem 


minimize  /o  ( x ) 

subject  to  fi(x)  <0,  i  =  1, ...  ,m 
Ax  =  b 
aTx  <  1. 


The  associated  initial  duality  gap  of  this  initial  point  is  no  more  than  ( m  + 1  )(M  — 
p*)  (see  (11.22)).  We  assume  the  phase  II  problem  also  satisfies  the  the  self¬ 
concordance  and  bounded  sublevel  set  assumptions  in  §11.5.1. 

We  now  proceed  to  phase  II,  again  using  the  barrier  method.  We  must  reduce 
the  duality  gap  from  its  initial  value,  which  is  no  more  than  (m  +  1  )(M  —  p*),  to 
some  tolerance  e  >  0.  This  takes  at  most 


Nu  = 


y/m  +  l  log2 


(to  +  1  )(M 
e 


(11.37) 


Newton  steps. 

The  total  number  of  Newton  steps  is  therefore  no  more  than  N\  +  Nu-  This 
bound  grows  with  the  number  of  inequalities  m  approximately  as  y/rn1  and  includes 
two  terms  that  depend  on  the  particular  problem  instance, 


,  GR  ,  M  -  p* 

082  Wl’  log2^— 


11.5.6  Summary 

The  complexity  analysis  given  in  this  section  is  mostly  of  theoretical  interest.  In 
particular,  we  remind  the  reader  that  the  choice  /./  =  1  +  1/y/m,  discussed  in  this 
section,  would  be  a  very  poor  one  to  use  in  practice;  its  only  advantage  is  that  it 
results  in  a  bound  that  grows  like  y/m.  instead  of  to.  Likewise,  we  do  not  recommend 
adding  the  redundant  inequality  aTx  <  1  in  practice. 

The  actual  bounds  obtained  from  the  analysis  given  here  are  far  higher  than  the 
numbers  of  iterations  actually  observed.  Even  the  order  in  the  bound  appears  to 
be  conservative.  The  best  bounds  on  the  number  of  Newton  steps  grow  like  y/rn, 
whereas  practical  experience  suggests  that  the  number  of  Newton  steps  hardly 
grows  at  all  with  m  (or  any  other  parameter,  in  fact). 

Still,  it  is  comforting  to  know  that  when  the  self-concordance  condition  holds, 
we  can  give  a  uniform  bound  on  the  number  of  Newton  steps  required  in  each 
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centering  step  of  the  barrier  method.  An  obvious  potential  pitfall  of  the  barrier 
method  is  the  possibility  that  as  t,  grows,  the  associated  centering  problems  might 
become  more  difficult,  requiring  more  Newton  steps.  While  practical  experience 
suggests  that  this  is  not  the  case,  the  uniform  bound  bolsters  our  confidence  that 
it  cannot  happen. 

Finally,  we  mention  that  it  is  not  yet  clear  whether  or  not  there  is  a  practical 
advantage  to  formulating  a  problem  so  that  the  self-concordance  condition  holds. 
All  we  can  say  is  that  when  the  self-concordance  condition  holds,  the  barrier  method 
will  work  well  in  practice,  and  we  can  give  a  worst  case  complexity  bound. 


11.6  Problems  with  generalized  inequalities 


In  this  section  we  show  how  the  barrier  method  can  be  extended  to  problems  with 
generalized  inequalities.  We  consider  the  problem 


minimize  fo{x) 

subject  to  fi(x)  A K.  0,  i  =  1, . . .  ,m  (11.38) 

Ax  =  b, 


where  /0  :  Rn  — >  R  is  convex,  f,  :  R"  — >■  Rfci,  i  =  1, . . . ,  k,  are  Kr convex,  and 
Ki  C  Rfei  are  proper  cones.  As  in  §11.1,  we  assume  that  the  functions  /)  are  twice 
continuously  differentiable,  that  A  £  Rpxn  with  rank  A  =  p,  and  that  the  problem 
is  solvable. 

The  KKT  conditions  for  problem  (11.38)  are 


Ax *  =  b 

fi(x *)  <Ki  0,  i  =  1, 

A*  >ik*  0,  i  =  1, 

V/o(**)  +  YZ i  Dfi(x*)T\*  +  ATv*  =  0 

A  =  0,  *  =  1, 


,  m 
,  m 


,  m. 


(11.39) 


where  Dfi(x*)  £  Rfe»x"  is  the  derivative  of  /*  at  x*.  We  will  assume  that  prob¬ 
lem  (11.38)  is  strictly  feasible,  so  the  KKT  conditions  are  necessary  and  sufficient 
conditions  for  optimality  of  x*. 

The  development  of  the  method  is  parallel  to  the  case  with  scalar  constraints. 
Once  we  develop  a  generalization  of  the  logarithm  function  that  applies  to  general 
proper  cones,  we  can  define  a  logarithmic  barrier  function  for  the  problem  (11.38). 
From  that  point  on,  the  development  is  essentially  the  same  as  in  the  scalar  case. 
In  particular,  the  central  path,  barrier  method,  and  complexity  analysis  are  very 
similar. 
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11.6.1  Logarithmic  barrier  and  central  path 

Generalized  logarithm  for  a  proper  cone 

We  first  define  the  analog  of  the  logarithm,  logcr,  for  a  proper  cone  K  C  R9.  We 
say  that  ip  :  R9  — >  R  is  a  generalized  logarithm  for  K  if 

•  ip  is  concave,  closed,  twice  continuously  differentiable,  dom  ip  =  int  A",  and 
V2-! p(y)  -<  0  for  y  €  int  K. 

•  There  is  a  constant  9  >  0  such  that  for  all  y  >~k  0,  and  all  s  >  0, 

tp(sy)  =  ip(y)  +  9\ogs. 

In  other  words,  ip  behaves  like  a  logarithm  along  any  ray  in  the  cone  K . 

We  call  the  constant  9  the  degree  of  ip  (since  exp  ip  is  a  homogeneous  function  of 
degree  9).  Note  that  a  generalized  logarithm  is  only  defined  up  to  an  additive 
constant;  if  ip  is  a  generalized  logarithm  for  AT,  then  so  is  ip  +  a,  where  a  £  R.  The 
ordinary  logarithm  is,  of  course,  a  generalized  logarithm  for  R+. 

We  will  use  the  following  two  properties,  which  are  satisfied  by  any  generalized 
logarithm:  If  y  >~k  0,  then 

Vt%)  Ak*  0,  (11.40) 

which  implies  ip  is  A'-increasing  (see  §3.6.1),  and 

yT\/iP{y)  =  9. 

The  first  property  is  proved  in  exercise  11.15.  The  second  property  follows  imme¬ 
diately  from  differentiating  ip{sy)  =  ip  (y)  +  9  logs  with  respect  to  s. 


Example  11.5  Nonnegative  orthant.  The  function  ip(x)  =  l°g  xi  is  a  generalized 

logarithm  for  K  =  R”,  with  degree  n.  For  x  >-  0, 

Vip(x)  =  (1/ asi , . . . ,  l/xn), 

so  Vip(x)  y  0,  and  xTVip{x)  =  n. 


Example  11.6  Second-order  cone.  The  function 

/  2  n  2 

1p(x)  =  log  x2n+1  -  ^2  Xi 

V  *= 1 

is  a  generalized  logarithm  for  the  second-order  cone 

/  n  \  1/2 


K  =  {  x  G  R 


n+1 


E- 


<  xn+l 
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with  degree  2.  The  gradient  of  ip  at  a  point  x  €  int  K  is  given  by 
dip(x)  _  —2  Xj 


dxj 

dif{x) 


te+i-EIU*?)’ 

2iXn-\-l 


j  = 


^  k+1-e:=1^2)' 

The  identities  V-i/>(x)  £  int  A'*  =  int  A'  and  xTX7'i/>(x)  =  2  are  easily  verified. 


Example  11.7  Positive  semidefinite  cone.  The  function  ip(X)  =  logdet  X  is  a  gen¬ 
eralized  logarithm  for  the  cone  S^.  The  degree  is  p,  since 

log  det(sX)  =  log  det  X  +  p  log  s 

for  s  >  0.  The  gradient  of  tp  at  a  point  X  £  S++  is  equal  to 

Xip(X)  =x~1. 

Thus,  we  have  V'tp(X)  =  X_1  >-  0,  and  the  inner  product  of  X  and  Vtp(X)  is  equal 
to  tr(X_Y-1)  =  p. 


Logarithmic  barrier  functions  for  generalized  inequalities 

Returning  to  problem  (11.38),  let  ipi , . . . ,  ipm  be  generalized  logarithms  for  the 
cones  A'i, . . . ,  Km,  respectively,  with  degrees  9\, . . . ,  0m.  We  define  the  logarithmic 
barrier  function  for  problem  (11.38)  as 

m 

4>{ x )  =  dom <j>  =  {x  |  fi(x)  -<  0,  i  =  1 ,.. .  ,m}. 

i= 1 

Convexity  of  4>  follows  from  the  fact  that  the  functions  ifi  are  A'j-increasing,  and 
the  functions  fi  are  A'i-convex  (see  the  composition  rule  of  §3.6.2). 

The  central  path 

The  next  step  is  to  define  the  central  path  for  problem  (11.38).  We  define  the 
central  point  x*(t),  for  t  >  0,  as  the  minimizer  of  tfo  +  (j>,  subject  to  Ax  =  b,  i.e., 
as  the  solution  of 

minimize  tf0(x)  -  J2Zi 
subject  to  Ax  =  b 

(assuming  the  minimizer  exists,  and  is  unique).  Central  points  are  characterized 
by  the  optimality  condition 

tX7f0(x)  +  V<p(x)  +  ATv 

m 

=  tVfQ(x)  +  Dfi(x)TVilJi{-fi(x))  +Atv  =  0,  (11.41) 

i= 1 

for  some  v  G  Rp,  where  Dfi(pc)  is  the  derivative  of  fi  at  x. 
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Dual  points  on  central  path 

As  in  the  scalar  case,  points  on  the  central  path  give  dual  feasible  points  for  the 
problem  (11.38).  For  i  =  1, ...  ,m,  define 

A  r(t)  =  (11-42) 

and  let  v*(t)  =  v/t,  where  v  is  the  optimal  dual  variable  in  (11.41).  We  will 
show  that  A^(f), . . . ,  A(^(t),  together  with  i/*(t),  are  dual  feasible  for  the  original 
problem  (11.38). 

First,  A *(f)  >-k*  0,  by  the  monotonicity  property  (11.40)  of  generalized  loga¬ 
rithms.  Second,  it  follows  from  (11.41)  that  the  Lagrangian 

m 

L{x,  A =  f0(x)  +  Y  K(t)Tfi(x)  +  v*(t)T(Ax  -  b ) 

i- 1 

is  minimized  over  x  by  x  =  x*(t).  The  dual  function  g  evaluated  at  (A 
is  therefore  equal  to 

m 

i= 1 

m 

=  fo{x*(t))  +  (i  A)  Y  vM-fi(x*(.t)))TMx*(t)) 

i= 1 
m 

=  fo(x*(t))-(l/t)Y°i, 

i= 1 

where  di  is  the  degree  of  if>i.  In  the  last  line,  we  use  the  fact  that  yTV-0j(y)  =  0i 
for  y^Ki  0,  and  therefore 

A Ut)Tfi(x*(t))  =  -9i/t,  i  =  1, . . .  ,m.  (11.43) 

Thus,  if  we  define 

m 

o  =  Y9" 

i=  1 

then  the  primal  feasible  point  x*(t )  and  the  dual  feasible  point  (A *(t),v*(t))  have 
duality  gap  Q/t.  This  is  just  like  the  scalar  case,  except  that  6,  the  sum  of  the 
degrees  of  the  generalized  logarithms  for  the  cones,  appears  in  place  of  to,  the 
number  of  inequalities. 


Example  11.8 

x  €  R": 


Second-order  cone  programming.  We  consider  an  SOCP  with  variable 


minimize  fTx 

subject  to  \\AiX  +  h\\2  <  c[x  +  di,  *  =  m, 


(11.44) 


where  A*  £  As  we  have  seen  in  example  11.6,  the  function 


ip{y)  =  log 
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is  a  generalized  logarithm  for  the  second-order  cone  in  Rp+1,  with  degree  2.  The 
corresponding  logarithmic  barrier  function  for  (11.44)  is 


(j>{x)  =  -  ^2  l°g((cfx  +  di)2  -  \\AiX  +  ||1),  (11.45) 

i=  1 

with  dome p  =  {x  \  || AiX  +  bi ||2  <  cf x  +  di,  i  =  1, . . . , m}.  The  optimality  condition 
on  the  central  path  is  tf  +  S7(j>(x*(t))  =  0,  where 


V<j>(x)  =  — 2 


^  (cf*  +  di)2  -  ||  AiX  +  b 
i=  1 


2 

ill  2 


((cf  x  +  di)a  -  Aj ( AiX  +  bi))  . 


It  follows  that  the  point 

z*  ( t )  =  -- —(AiX*{t)  +  bi), 
toil 


W*{t)  =  - —(c[x*(t)  +  di), 
tai 


i  =  1, . . . ,  m, 


where  cti  =  (cjx*(t)  +  di)2  —  ||A;a:*(i)  +  6j|||,  is  strictly  feasible  in  the  dual  problem 


maximize  —  (fef  Zj  +  diWi) 
subject  to  J2 7=  i(AT Zi  +  CiWi )  =  f 

\\zi\\2  <Wi,  i  =  1, . . . ,  m. 

The  duality  gap  associated  with  x *(t)  and  (z* (t) ,  w* (t))  is 


y  (( AiX*(t )  +  bi)Tz*(t)  +  (cf x* (t)  +  di)w*(t ))  = 

i=  1 

which  agrees  with  the  general  formula  9/t,  since  9i  =  2. 


Example  11.9  Semidefinite  programming  in  inequality  form.  We  consider  the  SDP 
with  variable  x  £  R’1, 

minimize  cTx 

subject  to  F(x)  =  xiFi  +  •  •  •  +  xnFn  +  G  X  0, 
where  G,F\, . . .  ,Fn  £  Sp.  The  dual  problem  is 
maximize  tr  (GZ) 

subject  to  tr(FiZ)  +  d  —  0,  i  =  1, . . . ,  n 

zto. 

Using  the  generalized  logarithm  logdet  A'  for  the  positive  semidefinite  cone  Sp  ,  we 
have  the  barrier  function  (for  the  primal  problem) 

<f>(x)  =  logdet  (—F(x)^1) 

with  domiji  =  {x  \  F[x)  -<  0}.  For  strictly  feasible  x,  the  gradient  of  cj>  is  equal  to 

=tr(-F(x)~1Fi),  i  —  1, . . .  ,n, 

which  gives  us  the  optimality  conditions  that  characterize  central  points: 

ta  +  tr (—F(x*(t))~1Fi)  =  0,  i  =  1, . . . ,  n. 

Hence  the  matrix 

Z*(t)  =  i  (-Fix'it)))-1 

is  strictly  dual  feasible,  and  the  duality  gap  associated  with  x*(t )  and  Z*(t)  is  p/t. 
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11.6.2  Barrier  method 

We  have  seen  that  the  key  properties  of  the  central  path  generalize  to  problems 
with  generalized  inequalities. 

•  Computing  a  point  on  the  central  path  involves  minimizing  a  twice  differ¬ 
entiable  convex  function  subject  to  equality  constraints  (which  can  be  done 
using  Newton’s  method). 

•  With  the  central  point  x*(t)  we  can  associate  a  dual  feasible  point  (A *(£),  z/*(£)) 
with  associated  duality  gap  9/t.  In  particular,  x*(t)  is  no  more  than  9/t- 
suboptimal. 

This  means  we  can  apply  the  barrier  method,  exactly  as  described  in  §11.3,  to  the 
problem  (11.38).  The  number  of  outer  iterations,  or  centering  steps,  required  to 
compute  a  central  point  with  duality  gap  e  starting  at  a;*(T0^)  is  equal  to 

~log(g/(tWe))l 

log^i 

plus  one  initial  centering  step.  The  only  difference  between  this  result  and  the 
associated  one  for  the  scalar  case  is  that  9  takes  the  place  of  to. 

Phase  I  and  feasibility  problems 

The  phase  I  methods  described  in  §11.4  are  readily  extended  to  problems  with 
generalized  inequalities.  Let  e*  k,  0  be  some  given,  /v.j-positive  vectors,  for 
i  =  1, . . . ,  to.  To  determine  feasibility  of  the  equalities  and  generalized  inequalities 

/i(i)  0,  ...,  fL(x)^Km  0,  Ax  =  b , 

we  solve  the  problem 

minimize  s 

subject  to  fi(x)  A Ki  sei,  i  =  1 ,m 
Ax  =  b, 

with  variables  x  and  s  €  R.  The  optimal  value  p *  determines  the  feasibility 
of  the  equalities  and  generalized  inequalities,  exactly  as  in  the  case  of  ordinary 
inequalities.  When  p*  is  positive,  any  dual  feasible  point  with  positive  objective 
gives  an  alternative  that  proves  the  set  of  equalities  and  generalized  inequalities  is 
infeasible  (see  page  270). 


11.6.3  Examples 

A  small  SOCP 

We  solve  an  SOCP 

minimize  fTx 

subject  to  \\AiX  +  6^ 1 1 2  <  cfx  +  di,  i—  1, . . . ,  to, 
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Figure  11.15  Progress  of  barrier  method  for  an  SOCP,  showing  duality  gap 
versus  cumulative  number  of  Newton  steps. 


with  x  €  R50,  to  =  50,  and  A,  £  R5x  .  The  problem  instance  was  randomly 
generated,  in  such  a  way  that  the  problem  is  strictly  primal  and  dual  feasible,  and 
has  optimal  value  p*  =  1.  We  start  with  a  point  a;*-0)  on  the  central  path,  with  a 
duality  gap  of  100. 

The  barrier  method  is  used  to  solve  the  problem,  using  the  barrier  function 

m 

<t>( x )  =  -  log  ((cfx  +  di)2  -  \\AiX  +  bi\\l)  ■ 

i= 1 

The  centering  problems  are  solved  using  Newton’s  method,  with  the  same  algorithm 
parameters  as  in  the  examples  of  §11.3.2:  backtracking  parameters  a  =  0.01,  /3  = 
0.5,  and  a  stopping  criterion  \{x)2 /2  <  10-5. 

Figure  11.15  shows  the  duality  gap  versus  cumulative  number  of  Newton  steps. 
The  plot  is  very  similar  to  those  for  linear  and  geometric  programming,  shown 
in  figures  11.4  and  11.6,  respectively.  We  see  an  approximately  constant  number 
of  Newton  steps  required  per  centering  step,  and  therefore  approximately  linear 
convergence  of  the  duality  gap.  For  this  example,  too,  the  choice  of  p  has  little 
effect  on  the  total  number  of  Newton  steps,  provided  p  is  at  least  10  or  so.  As  in 
the  examples  for  linear  and  geometric  programming,  a  reasonable  choice  of  /i  is  in 
the  range  10  -  100,  which  results  in  a  total  number  of  Newton  steps  around  30  (see 
figure  11.16). 

A  small  SDP 

Our  next  example  is  an  SDP 

minimize 
subject  to 


T 

C  X 


xi^i  +  G  A  0 


(11.46) 
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Figure  11.16  Trade-off  in  the  choice  of  the  parameter  /x,  for  a  small  SOCP. 
The  vertical  axis  shows  the  total  number  of  Newton  steps  required  to  reduce 
the  duality  gap  from  100  to  10  3 ,  and  the  horizontal  axis  shows  /x. 


with  variable  x  £  R100,  and  F)  £  S100,  G  £  S100.  The  problem  instance  was 
generated  randomly,  in  such  a  way  that  the  problem  is  strictly  primal  and  dual 
feasible,  with  p*  =  1.  The  initial  point  is  on  the  central  path,  with  a  duality  gap 
of  100. 

We  apply  the  barrier  method  with  logarithmic  barrier  function 
cj)(x)  =  -  logdet  xiFi  ~  G^J  ■ 

The  progress  of  the  barrier  method  for  three  values  of  /x  is  shown  in  figure  11.17. 
Note  the  similarity  with  the  plots  for  linear,  geometric,  and  second-order  cone 
programming,  shown  in  figures  11.4,  11.6,  and  11.15.  As  in  the  other  examples, 
the  parameter  /x  has  only  a  small  effect  on  the  efficiency,  provided  it  is  not  too 
small.  The  number  of  Newton  steps  required  to  reduce  the  duality  gap  by  a  factor 
105,  versus  yx,  is  shown  in  figure  11.18. 


A  family  of  SDPs 

In  this  section  we  examine  the  performance  of  the  barrier  method  as  a  function  of 
the  problem  dimensions.  We  consider  a  family  of  SDPs  of  the  form 

minimize  lTx  ,  , 

subject  to  A  +  diag(x)  ^  0,  '  ' 

with  variable  x  £  R",  and  parameter  A  G  Sn.  The  matrices  A  are  generated  as 
follows.  For  i  >  j,  the  coefficients  Ai:j  are  generated  from  independent  A/”(0, 1) 
distributions.  For  i  <  j,  we  set  A,j  =  Aji,  so  A  G  S".  We  then  scale  A  so  that  its 
(spectral)  norm  is  one. 
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Figure  11.17  Progress  of  barrier  method  for  a  small  SDP,  showing  duality 
gap  versus  cumulative  number  of  Newton  steps.  Three  plots  are  shown, 
corresponding  to  three  values  of  the  parameter  fj,:  2,  50,  and  150. 


Figure  11.18  Trade-off  in  the  choice  of  the  parameter  /r,  for  a  small  SDP. 
The  vertical  axis  shows  the  total  number  of  Newton  steps  required  to  reduce 
the  duality  gap  from  100  to  10  3 ,  and  the  horizontal  axis  shows  \x. 
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Figure  11.19  Progress  of  barrier  method  for  three  randomly  generated  SDPs 
of  the  form  (11.47),  with  different  dimensions.  The  plot  shows  duality  gap 
versus  cumulative  number  of  Newton  steps.  The  number  of  variables  in  each 
problem  is  n. 


The  algorithm  parameters  are  fj,  =  20,  and  the  same  parameters  for  the  center¬ 
ing  steps  as  in  the  examples  above:  backtracking  parameters  a  =  0.01,  /?  =  0.5, 
and  stopping  criterion  A(x)2/2  <  10-5.  The  initial  point  is  on  the  central  path 
with  =  1  (i.e.,  gap  n).  The  algorithm  is  terminated  when  the  initial  duality 
gap  is  reduced  by  a  factor  8000,  i.e.,  after  completing  three  outer  iterations. 

Figure  11.19  shows  the  duality  gap  versus  iteration  number  for  three  problem 
instances,  with  dimensions  n  =  50,  n  =  500,  and  n  =  1000.  The  plots  look  very 
much  like  the  others,  and  very  much  like  the  ones  for  LPs. 

To  examine  the  effect  of  problem  size  on  the  number  of  Newton  steps  required, 
we  generate  100  problem  instances  for  each  of  20  values  of  n,  ranging  from  n  =  10 
to  n  =  1000.  We  solve  each  of  these  2000  problems  using  the  barrier  method,  noting 
the  number  of  Newton  steps  required.  The  results  are  summarized  in  figure  11.20, 
which  shows  the  mean  and  standard  deviation  in  the  number  of  Newton  steps,  for 
each  value  of  n.  The  plot  looks  very  much  like  the  one  for  LPs,  shown  in  figure  11.8. 
In  particular,  the  number  of  Newton  steps  required  grows  very  slowly,  from  around 
20  to  26  iterations,  as  the  problem  dimensions  increase  by  a  factor  of  100. 


11.6.4  Complexity  analysis  via  self-concordance 

In  this  section  we  extend  the  complexity  analysis  of  the  barrier  method  for  problems 
with  ordinary  inequalities  (given  in  §11.5),  to  problems  with  generalized  inequali¬ 
ties.  We  have  already  seen  that  the  number  of  outer  iterations  is  given  by 

'log (0/tWe)' 
log 
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Figure  11.20  Average  number  of  Newton  steps  required  to  solve  100  ran¬ 
domly  generated  SDPs  (11.47)  for  each  of  20  values  of  n,  the  problem  size. 
Error  bars  show  standard  deviation,  around  the  average  value,  for  each  value 
of  n.  The  growth  in  the  average  number  of  Newton  steps  required,  as  the 
problem  dimensions  range  over  a  100: 1  ratio,  is  very  small. 


plus  one  initial  centering  step.  It  remains  to  bound  the  number  of  Newton  steps 
required  in  each  centering  step,  which  we  will  do  using  the  complexity  theory  of 
Newton’s  method  for  self-concordant  functions.  For  simplicity,  we  will  exclude  the 
cost  of  the  initial  centering. 

We  make  the  same  assumptions  as  in  §11.5:  The  function  tfo  +  4>  is  closed  and 
self-concordant  for  all  t  >  t^\  and  the  sublevel  sets  of  (11.38)  are  bounded. 


Example  11.10  Second-order  cone  programming.  The  function 


-ip(x)  =  -  log  xp+ 1  - 


p 


is  self-concordant  (see  example  9.8),  so  the  logarithmic  barrier  function  (11.45)  sat¬ 
isfies  the  closedness  and  self-concordance  assumption  for  the  SOCP  (11.44). 


Example  11.11  Semidefinite  programming.  The  self-concordance  assumption  holds 
for  general  semidefinite  programs,  using  logdetA'  as  generalized  logarithm  for  the 
positive  semidefinite  cone.  For  example,  for  the  standard  form  SDP 

minimize  tr  (CAT) 

subject  to  tr(AiX)  =  bi,  i  =  l,...,p 

x  y  o, 

with  variable  X  £  Sn,  the  function  tr(CX)  —  logdetA'  is  self-concordant  (and 
closed),  for  any  >  0. 
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We  will  see  that,  exactly  as  in  the  scalar  case,  we  have 

+  <p{x*{t))  -  ntf0{x*{nt))  -  <  0(n  -  1  -  log n) .  (11.48) 

Therefore  when  the  self-concordance  and  bounded  sublevel  set  conditions  hold,  the 
number  of  Newton  steps  per  centering  step  is  no  more  than 

9(fi-  1-log/z) 

- he, 

7 

exactly  as  in  the  barrier  method  for  problems  with  ordinary  inequalities.  Once 
we  establish  the  basic  bound  (11.48),  the  complexity  analysis  for  problems  with 
generalized  inequalities  is  identical  to  the  analysis  for  problems  with  ordinary  in¬ 
equalities,  with  one  exception:  8  is  the  sum  of  the  degrees  of  the  cones,  instead  of 
the  number  of  inequalities. 

Generalized  logarithm  for  dual  cone 

We  will  use  conjugates  to  prove  the  bound  (11.48).  Let  ip  be  a  generalized  logarithm 
for  the  proper  cone  K ,  with  degree  8.  The  conjugate  of  the  (convex)  function  —ip 
is 

(—ip)*(v)  =  sup  ( vTu  +  ip(u))  . 

U 

This  function  is  convex,  and  has  domain  —K*  =  {v  |  v  -<k *  0}.  Define  ip  by 

ip{v)  =  —(—ip)*(— v)  =  inf  (vTu  —  ijj(u))  ,  dom  ^  =  int  K * .  (11.49) 

The  function  ijj  is  concave,  and  in  fact  is  a  generalized  logarithm  for  the  dual  cone 
K* ,  with  the  same  parameter  9  (see  exercise  11.17).  We  call  ip  the  dual  logarithm 
associated  with  the  generalized  logarithm  ip. 

From  (11.49)  we  obtain  the  inequality 


ip(v)  +  ip(u)  <  uTv,  (11.50) 

which  holds  for  any  u  )~k  0,  v  )~k*  0,  with  equality  holding  if  and  only  Wip{u)  =  v 
(or  equivalently,  S7ip(v )  =  u).  (This  inequality  is  just  a  variation  on  Young’s 
inequality,  for  concave  functions.) 


Example  11.12  Second-order  cone.  The  second-order  cone  has  generalized  logarithm 
ip{x)  =  log (Xp+1-^=1Xi),  with  dorm/:  =  {x  G  Rp+1  |  xv+i  >  (X)?=1  xl )1/2}-  The 
associated  dual  logarithm  is 


i>{y) 


fog  h/p+i  -  ^2  v *2 


+  2 -log  4, 


with  domi p  =  {y  £  Rp+1  |  yp+i  >  yf)1^2}  (see  exercise  3.36).  Except  for 

a  constant,  it  is  the  same  as  the  original  generalized  logarithm  for  the  second-order 


cone. 
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Example  11.13  Positive  semidefinite  cone.  The  dual  logarithm  associated  with 
ip{X)  =  logdet.Y,  with  dom  i/j  =  S((+,  is 

ip(Y)  =  logdet  Y  +  p, 

with  domain  domi/i*  =  Sf  +  (see  example  3.23).  Again,  it  is  the  same  generalized 
logarithm,  except  for  a  constant. 


Derivation  of  the  basic  bound 

To  simplify  notation,  we  denote  x*(t)  as  x,  x*(pt)  as  x+,  A *(t)  as  A*,  and  v*{t)  as 
v.  From  tA*  =  Vipi(—fi(x))  (in  (11.42))  and  property  (11.43),  we  conclude  that 

^i{~fi{x))  +4>i{t\i)  =  -tXf  fi(x)  =  Oi,  (11.51) 

i.e.,  the  inequality  (11.50)  holds  with  equality  for  the  pair  u  =  —fi(x)  and  v  =  tXi . 
The  same  inequality  for  the  pair  u  =  —fi(x+),  v  =  putXi  gives 

i>i(~fi{x+))  +  if>i(ntXi)  <  -ptXj fi(x+), 

which  becomes,  using  logarithmic  homogeneity  of  ipi, 

ipi(-fi{x+))  +  ipi(tXi)  +  Oi  log^i  <  -fitXf  fi(x+). 

Subtracting  the  equality  (11.51)  from  this  inequality,  we  get 

-Ai-fiix))  +  +  Oi  log  n  <  -Oi  -  ntXj’fi(x+), 

and  summing  over  i  yields 

m 

<j>(x)  —  (j){x+)  +  0\ogp<-0  —  M^Af/i(x+).  (11.52) 

i= 1 

We  also  have,  from  the  definition  of  the  dual  function, 
fo(x)-0/t  =  g(  X,v) 

m 

<  fo(x+)  +  Y  A i.f,{x+)  +  vt(Ax+  -  b) 

i= 1 
m 

=  fo(x+)  +YXi  Mx+)- 

i= 1 

Multiplying  this  inequality  by  /it  and  adding  to  the  inequality  (11.52),  we  get 
<p(x)  -  (j)(x+)  +  0  \ogg  +  ntfo(x)  -  pO  <  ptfo(x+)  -  0, 
which  when  re-arranged  gives 

ptf0(x)  +  </)(x)  -  ptf0{x+)  -  (f){x+)  <0(ii-l-  log /Li), 


the  desired  inequality  (11.48). 
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11.7  Primal-dual  interior-point  methods 

In  this  section  we  describe  a  basic  primal-dual  interior-point  method.  Primal- 
dual  interior-point  methods  are  very  similar  to  the  barrier  method,  with  some 
differences. 

•  There  is  only  one  loop  or  iteration,  i.e.,  there  is  no  distinction  between  inner 
and  outer  iterations  as  in  the  barrier  method.  At  each  iteration,  both  the 
primal  and  dual  variables  are  updated. 

•  The  search  directions  in  a  primal-dual  interior-point  method  are  obtained 
from  Newton’s  method,  applied  to  modified  KKT  equations  (i.e.,  the  opti¬ 
mality  conditions  for  the  logarithmic  barrier  centering  problem) .  The  primal- 
dual  search  directions  are  similar  to,  but  not  quite  the  same  as,  the  search 
directions  that  arise  in  the  barrier  method. 

•  In  a  primal-dual  interior-point  method,  the  primal  and  dual  iterates  are  not 
necessarily  feasible. 

Primal-dual  interior-point  methods  are  often  more  efficient  than  the  barrier 
method,  especially  when  high  accuracy  is  required,  since  they  can  exhibit  better 
than  linear  convergence.  For  several  basic  problem  classes,  such  as  linear,  quadratic, 
second-order  cone,  geometric,  and  semidefinite  programming,  customized  primal- 
dual  methods  outperform  the  barrier  method.  For  general  nonlinear  convex  op¬ 
timization  problems,  primal-dual  interior-point  methods  are  still  a  topic  of  active 
research,  but  show  great  promise.  Another  advantage  of  primal-dual  algorithms 
over  the  barrier  method  is  that  they  can  work  when  the  problem  is  feasible,  but 
not  strictly  feasible  (although  we  will  not  pursue  this). 

In  this  section  we  present  a  basic  primal-dual  method  for  (11.1),  without  conver¬ 
gence  analysis.  We  refer  the  reader  to  the  references  for  a  more  thorough  treatment 
of  primal-dual  methods  and  their  convergence  analysis. 


11.7.1  Primal-dual  search  direction 


As  in  the  barrier  method,  we  start  with  the  modified  KKT  conditions  (11.15), 
expressed  as  ry( x,  A,  v)  =  0,  where  we  define 


rt(x, \,v) 


Vf0{x)  +  Df{x)T\  +  ATv 
-diag(A )f(x)  -  (l/f)l 


Ax  —  b 


(11.53) 


and  t  >  0.  Here  /  :  R”  —>  Rm  and  its  derivative  matrix  Df  are  given  by 


fl(x) 

'  VA  (x)T  - 

f{x)  = 

,  Df(x)  = 

_  fm{x)  _ 

_  \7fm(x)T  _ 

If  x,  A,  v  satisfy  rt(x, \,v)  =  0  (and  fi(x)  <  0),  then  x  =  x*(t),  A  =  A *(t),  and 
v  =  v*(t).  In  particular,  x  is  primal  feasible,  and  A,  v  are  dual  feasible,  with 
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duality  gap  m/t.  The  first  block  component  of  rt, 

rduai  =  Vfo(x)  +  Df(x)T\  +  Atv , 

is  called  the  dual  residual ,  and  the  last  block  component,  rpri  =  Ax  —  b,  is  called 
the  primal  residual.  The  middle  block, 

rCent  =  -  diag(A)/(x)  -  (l/t)l, 

is  the  centrality  residual ,  i.e.,  the  residual  for  the  modified  complementarity  condi¬ 
tion. 

Now  consider  the  Newton  step  for  solving  the  nonlinear  equations  rt(x,  A,  v)  = 
0,  for  fixed  t  (without  first  eliminating  A,  as  in  §11.3.4),  at  a  point  (x,X,u)  that 
satisifes  f{x)  -<  0,  A  >~  0.  We  will  denote  the  current  point  and  Newton  step  as 

y  =  (x,  A,  v),  Ay  =  (Ax,  AA,  Av), 

respectively.  The  Newton  step  is  characterized  by  the  linear  equations 

rt{y  +  Ay)  w  rt(y)  +  Drt(y)Ay  =  0, 


i.e.,  Ay  =  —Drt(y)  1rt(y).  In  terms  of  x,  A,  and  v,  we  have 


V2/o(a;)  +  EHiAiV2/i(a;)  Df{x)T  AT 

Ax 

?"dual 

-  diag(A )Df(x)  -  diag (f{x))  0 

AA 

=  - 

^cent 

A  0  0 

Av 

rPvi 

(11.54) 

The  primal-dual  search  direction  Aypd  =  (Axpd,  AApd,  Ai/pd)  is  defined  as  the 
solution  of  (11.54). 

The  primal  and  dual  search  directions  are  coupled,  both  through  the  coefficient 
matrix  and  the  residuals.  For  example,  the  primal  search  direction  Axpd  depends 
on  the  current  value  of  the  dual  variables  A  and  v,  as  well  as  x.  We  note  also  that 
if  x  satisfies  Ax  =  b,  i.e.,  the  primal  feasibility  residual  rpri  is  zero,  then  we  have 
AAxpd  =  0,  so  Axpd  defines  a  (primal)  feasible  direction:  for  any  s,  x  +  sAxpd 
will  satisfy  A(x  +  sAxpd)  =  b. 

Comparison  with  barrier  method  search  directions 

The  primal-dual  search  directions  are  closely  related  to  the  search  directions  used 
in  the  barrier  method,  but  not  quite  the  same.  We  start  with  the  linear  equa¬ 
tions  (11.54)  that  define  the  primal-dual  search  directions.  We  eliminate  the  vari¬ 
able  AApd,  using 

AApd  =  -  diag(/(a:))_1  diag(A)H/(x)Aa;pd  +  diag(/(a’))^1rcent, 

which  comes  from  the  second  block  of  equations.  Substituting  this  into  the  first 
block  of  equations  gives 


'  Hpd  AT  ' 

Axpd 

A  0 

Az/pd 

rduai  +  Df(x)T  diag(f(x))  1rcent 
rpri 

'  V/0(a;)  +  (1/t)  £"Li  377^ ^ h{x)  +  ATv  ' 
rpri 


(11.55) 
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where 

m  m  , 

Hpd  =  V2f0(x)  +  J2  A iV2/i(z)  +  Y,  ^rr-yMx^Mxf.  (11.56) 

*= 1  *= l  An  ) 

We  can  compare  (11.55)  to  the  equation  (11.14),  which  defines  the  Newton  step 
for  the  centering  problem  in  the  barrier  method  with  parameter  t.  This  equation 
can  be  written  as 

//bar  -1  — CX'lja.r 

A  0  X'bar 

tVfo(x)  +  V(/)(x) 

T  pri 

tv/o(*)  +  E2=i=^v/i(*) 
rPri 

where 

m  ^  m  . 

7/bar  =  tV2f0(x)  +  2  ^  fi{x)S7  fi{x)T .  (11.58) 

i=i  AiW  i=1  /iW 

(Here  we  give  the  general  expression  for  the  infeasible  Newton  step;  if  the  current  x 
is  feasible,  i.e.,  rpri  =  0,  then  Aa^bar  coincides  with  the  feasible  Newton  step  Aa;nt 
defined  in  (11.14).) 

Our  first  observation  is  that  the  two  systems  of  equations  (11.55)  and  (11.57) 
are  very  similar.  The  coefficient  matrices  in  (11.55)  and  (11.57)  have  the  same 
structure;  indeed,  the  matrices  Hpd  and  Hdal  are  both  positive  linear  combinations 
of  the  matrices 

V2fo(x),  V2h(x), . . . ,  V2,fm(x),  V/1(a;)V/1(z)T, . . . ,  V/m(x)V/ro0r)T 


(11.57) 


This  means  that  the  same  method  can  be  used  to  compute  the  primal-dual  search 
directions  and  the  barrier  method  Newton  step. 

We  can  say  more  about  the  relation  between  the  primal-dual  equations  (11.55) 
and  the  barrier  method  equations  (11.57).  Suppose  we  divide  the  first  block  of 
equation  (11.57)  by  t,  and  define  the  variable  At'bar  =  (l/i)zT>ar  —  v  (where  v  is 
arbitrary).  Then  we  obtain 

'  (l/t)Hhai  AT  1  T  Azbar  1  =  T  V/0(x)  +  (l/i)E^i  zr^yV/i(a;)  +  ATv  ' 
A  0  Al^bar  Cprj 

In  this  form,  the  righthand  side  is  identical  to  the  righthand  side  of  the  primal-dual 
equations  (evaluated  at  the  same  x,  A,  and  v).  The  coefficient  matrices  differ  only 
in  the  1, 1  block: 

m  m  . 

Hpd  =  V2/oOr)  +  £  A ,V2Mx)  +  J2  —FTAVMx)Vfi(x)T, 
i= i  i= i  Ji^x) 

m  1  m  1 

(l/t)Hhai  =  V2f0(x)  +  J^-^)V2fi(x)  +  J^ij-^Vfi(x)Vfi(x)T. 


When  x  and  A  satisfy  —fi(x)\i  =  1/t,  the  coefficient  matrices,  and  therefore  also 
the  search  directions,  coincide. 
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11.7.2  The  surrogate  duality  gap 

In  the  primal-dual  interior-point  method  the  iterates  x^k\  \^k\  and  v ^  are  not 
necessarily  feasible,  except  in  the  limit  as  the  algorithm  converges.  This  means 
that  we  cannot  easily  evaluate  a  duality  gap  associated  with  step  k  of  the 
algorithm,  as  we  do  in  (the  outer  steps  of)  the  barrier  method.  Instead  we  define 
the  surrogate  duality  gap ,  for  any  x  that  satisfies  f(x)  -<  0  and  A  >;  0,  as 

V(x,  A)  =  -f{x)T\.  (11.59) 

The  surrogate  gap  f)  would  be  the  duality  gap,  if  x  were  primal  feasible  and  A,  v 
were  dual  feasible,  i.e.,  if  rprj  =  0  and  rduai  =  0.  Note  that  the  value  of  the 
parameter  t  that  corresponds  to  the  surrogate  duality  gap  i)  is  m/r). 


11.7.3  Primal-dual  interior-point  method 

We  can  now  describe  the  basic  primal-dual  interior-point  algorithm. 


Algorithm  11.2  Primal-dual  interior-point  method. 

given  x  that  satisfies  fi(x)  <  0, . . . ,  fm(x)  <0,  A  >~  0,  p  >  1,  efeas  >  0,  e  >  0. 

repeat 

1.  Determine  t.  Set  t  :=  pm/r). 

2.  Compute  primal-dual  search  direction  A ypa- 

3.  Line  search  and  update. 

Determine  step  length  s  >  0  and  set  y  :=  y  +  sAyp^. 

Until  1 1 T*  pri  1 1 2  ^  Cfeas,  ||r<iual||2  ^  Cfeas,  and  7)  ^  €. 


In  step  1,  the  parameter  t  is  set  to  a  factor  p  times  m/r) ,  which  is  the  value  of  t 
associated  with  the  current  surrogate  duality  gap  f).  If  x,  A,  and  v  were  central, 
with  parameter  t  (and  therefore  with  duality  gap  m/t),  then  in  step  1  we  would 
increase  t.  by  the  factor  p,  which  is  exactly  the  update  used  in  the  barrier  method. 
Values  of  the  parameter  p  on  the  order  of  10  appear  to  work  well. 

The  primal-dual  interior-point  algorithm  terminates  when  x  is  primal  feasible 
and  A,  v  are  dual  feasible  (within  the  tolerance  efeas)  and  the  surrogate  gap  is 
smaller  than  the  tolerance  e.  Since  the  primal-dual  interior-point  method  often  has 
faster  than  linear  convergence,  it  is  common  to  choose  £feas  and  e  small. 

Line  search 

The  line  search  in  the  primal-dual  interior  point  method  is  a  standard  backtracking 
line  search,  based  on  the  norm  of  the  residual,  and  modified  to  ensure  that  A  >-  0 
and  f(x)  -<  0.  We  denote  the  current  iterate  as  x,  A,  and  v,  and  the  next  iterate 
as  x+,  A+,  and  v+ ,  i.e,, 

x+  =  x  +  sAxpd,  A+  =  A  +  sAApd,  v+  =  v  +  sAt^d- 
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The  residual,  evaluated  at  y+,  will  be  denoted  r+. 

We  first  compute  the  largest  positive  step  length,  not  exceeding  one,  that  gives 
A+  >z  0,  i.e., 


gmax  _  SUp|s  g  [0, 1]  |  A  +  sAA  A  0} 

=  min{l,  min{— Aj/AAj  |  AA*  <  0}}. 

We  start  the  backtracking  with  s  =  0.99smax,  and  multiply  s  by  /3  £  (0, 1)  until  we 
have  ,f(x+)  -<  0.  We  continue  multiplying  s  by  /3  until  we  have 

\\rt(x+,\+:is+)\\2  <  (1  -  as)\\rt(x,X,p)\\2. 

Common  choices  for  the  backtracking  parameters  a  and  f3  are  the  same  as  those  for 
Newton’s  method:  a  is  typically  chosen  in  the  range  0.01  to  0.1,  and  /3  is  typically 
chosen  in  the  range  0.3  to  0.8. 

One  iteration  of  the  primal-dual  interior-point  algorithm  is  the  same  as  one  step 
of  the  infeasible  Newton  method,  applied  to  solving  rt(x ,  A,  u)  =  0,  but  modified  to 
ensure  A  >-  0  and  f{x)  -<  0  (or,  equivalently,  with  domrt  restricted  to  A  >~  0  and 
f(x)  -<  0).  The  same  arguments  used  in  the  proof  of  convergence  of  the  infeasible 
start  Newton  method  show  that  the  line  search  for  the  primal-dual  method  always 
terminates  in  a  finite  number  of  steps. 


11.7.4  Examples 

We  illustrate  the  performance  of  the  primal-dual  interior-point  method  for  the 
same  problems  considered  in  §11.3.2.  The  only  difference  is  that  instead  of  starting 
with  a  point  on  the  central  path,  as  in  §11.3.2,  we  start  the  primal-dual  interior- 
point  method  at  a  randomly  generated  that  satisfies  f(x)  -<  0,  and  take 

A!0)  =  -1  //i(ar(0)),  so  the  initial  value  of  the  surrogate  gap  is  fj  =  100.  The 
parameter  values  we  use  for  the  primal-dual  interior-point  method  are 

At  =  10,  p  =  0.5,  e  =  10— 8 ,  a  =  0.01. 

Small  LP  and  GP 

We  first  consider  the  small  LP  used  in  §11.3.2,  with  m  =  100  inequalities  and 
n  =  50  variables.  Figure  11.21  shows  the  progress  of  the  primal-dual  interior-point 
method.  Two  plots  are  shown:  the  surrogate  gap  r),  and  the  norm  of  the  primal 
and  dual  residuals, 

ffeas  =  ( 1 1 t’pri  1 1 2  +  ll^dual  |||)  ^  , 

versus  iteration  number.  (The  initial  point  is  primal  feasible,  so  the  plot  shows  the 
norm  of  the  dual  feasibility  residual.)  The  plots  show  that  the  residual  converges 
to  zero  rapidly,  and  becomes  zero  to  numerical  precision  in  24  iterations.  The 
surrogate  gap  also  converges  rapidly.  Compared  to  the  barrier  method,  the  primal- 
dual  interior-point  method  is  faster,  especially  when  high  accuracy  is  required. 

Figure  11.22  shows  the  progress  of  the  primal-dual  interior-point  method  on  the 
GP  considered  in  §11.3.2.  The  convergence  is  similar  to  the  LP  example. 
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Figure  11.21  Progress  of  the  primal-dual  interior-point  method  for  an  LP, 
showing  surrogate  duality  gap  rj  and  the  norm  of  the  primal  and  dual  resid¬ 
uals,  versus  iteration  number.  The  residual  converges  rapidly  to  zero  within 
24  iterations;  the  surrogate  gap  also  converges  to  a  very  small  number  in 
about  28  iterations.  The  primal-dual  interior-point  method  converges  faster 
than  the  barrier  method,  especially  if  high  accuracy  is  required. 


Figure  11.22  Progress  of  primal-dual  interior-point  method  for  a  GP,  show¬ 
ing  surrogate  duality  gap  t)  and  the  norm  of  the  primal  and  dual  residuals 
versus  iteration  number. 
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Figure  11.23  Number  of  iterations  required  to  solve  randomly  generated 
standard  LPs  of  different  dimensions,  with  n  =  2m.  Error  bars  show  stan¬ 
dard  deviation,  around  the  average  value,  for  100  instances  of  each  dimen¬ 
sion.  The  growth  in  the  number  of  iterations  required,  as  the  problem  di¬ 
mensions  range  over  a  100 : 1  ratio,  is  approximately  logarithmic. 


A  family  of  LPs 

Here  we  examine  the  performance  of  the  primal-dual  method  as  a  function  of 
the  problem  dimensions,  for  the  same  family  of  standard  form  LPs  considered 
in  §11.3.2.  We  use  the  primal-dual  interior-point  method  to  solve  the  same  2000 
instances,  which  consist  of  100  instances  for  each  value  of  to.  The  primal-dual 
algorithm  is  started  at  =  1,  A*'0)  =  1,  =  0,  and  terminated  using  tolerance 

e  =  10-8.  Figure  11.23  shows  the  average,  and  standard  deviation,  of  the  number 
of  iterations  required  versus  to.  The  number  of  iterations  ranges  from  15  to  35, 
and  grows  approximately  as  the  logarithm  of  m.  Comparing  with  the  results  for 
the  barrier  method  shown  in  figure  11.8,  we  see  that  the  number  of  iterations  in 
the  primal-dual  method  is  only  slightly  higher,  despite  the  fact  that  we  start  at 
infeasible  starting  points,  and  solve  the  problem  to  a  much  higher  accuracy. 
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The  main  effort  in  the  barrier  method  is  computing  the  Newton  step  for  the  cen¬ 
tering  problem,  which  consists  of  solving  sets  of  linear  equations  of  the  form 


'  H 

AT  ' 

9 

A 

0 

Z'nt 

0 

(11.60) 


where 


m  i  m  i 

H  =  tV2fo(x)+^—^Vfi(x)Vfi(x)T  +  '£——V2fi(x) 
i=  i  Ji^x>  i=  i  JAX> 
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m  - 

g  =  t.Vf0(x)+^2_  Vfi(x ). 

i=i  JAX> 

The  Newton  equations  for  the  primal-dual  method  have  exactly  the  same  structure, 
so  our  observations  in  this  section  apply  to  the  primal-dual  method  as  well. 

The  coefficient  matrix  of  (11.60)  has  KKT  structure,  so  all  of  the  discussion 
in  §9.7  and  §10.4  applies  here.  In  particular,  the  equations  can  be  solved  by  elimi¬ 
nation,  and  structure  such  as  sparsity  or  diagonal  plus  low  rank  can  be  exploited. 
Let  us  give  some  generic  examples  in  which  the  special  structure  of  the  KKT  equa¬ 
tions  can  be  exploited  to  compute  the  Newton  step  more  efficiently. 

Sparse  problems 

If  the  original  problem  is  sparse,  which  means  that  the  objective  and  every  con¬ 
straint  function  each  depend  on  only  a  modest  number  of  variables,  then  the  gradi¬ 
ents  and  Hessian  matrices  of  the  objective  and  constraint  functions  are  all  sparse, 
as  is  the  coefficient  matrix  A.  Provided  m  is  not  too  big,  the  matrix  H  is  then 
likely  to  be  sparse,  so  a  sparse  matrix  method  can  be  used  to  compute  the  Newton 
step.  The  method  will  likely  work  well  if  there  are  a  few  relatively  dense  rows  and 
columns  in  the  KKT  matrix,  which  would  occur,  for  example,  if  there  were  a  few 
equality  constraints  involving  a  large  number  of  variables. 

Separable  objective  and  a  few  linear  inequality  constraints 

Suppose  the  objective  function  is  separable,  and  there  are  only  a  relatively  small 
number  of  linear  equality  and  inequality  constraints.  Then  V2/o(a;)  is  diagonal, 
and  the  terms  V2/i(x)  vanish,  so  the  matrix  H  is  diagonal  plus  low  rank.  Since  H 
is  easily  inverted,  we  can  solve  the  KKT  equations  efficiently.  The  same  method 
can  be  applied  whenever  V2/o(x)  is  easily  inverted,  e.g.,  banded,  sparse,  or  block 
diagonal. 


11.8.1  Standard  form  linear  programming 

We  first  discuss  the  implementation  of  the  barrier  method  for  the  standard  form 
LP 

minimize  cTx 

subject  to  Ax  =  b,  x  >:  0, 

with  A  G  Rmxn.  The  Newton  equations  for  the  centering  problem 

minimize  tcTx  —  ^"=i  1 °Sxi 
subject  to  Ax  =  b 


diag(:r)  2 

AT  ' 

Cl 

<1 

— tc  +  diag(a:)  11 

A 

0 

1  ^nt 

0 

are  given  by 
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These  equations  are  usually  solved  by  block  elimination  of  Accnt.  From  the  first 
equation, 


Aa;nt  =  diag(a;)2(— te+  diag(x)  1l-ATvnt) 

=  —  tdiag(x)2c  +  x  —  diag(cc)2AT^nt. 

Substituting  in  the  second  equation  yields 

Adiag(a;)2.ATi/nt  =  —  tvldiag(x)2c  +  b. 

The  coefficient  matrix  is  positive  definite  since  by  assumption  rank  A  =  m.  More¬ 
over  if  A  is  sparse,  then  usually  A  diag(a;)2AT  is  sparse,  so  a  sparse  Cholesky 
factorization  can  be  used. 


11.8.2  £i-norm  approximation 


Consider  the  C-norm  approximation  problem 

minimize  ||  Ar  —  6||i 


with  A  £  Rmx™.  We  will  discuss  the  implementation  assuming  m  and  n  are  large, 
and  A  is  structured,  e.g.,  sparse,  and  compare  it  with  the  cost  of  the  corresponding 
least-squares  problem 

minimize  || Ax  —  b\\2  . 

We  start  by  expressing  the  l\ -norm  approximation  problem  as  an  LP  by  intro¬ 
ducing  auxiliary  variables  y  £  Rm: 


minimize 
subject  to 


1  Ty 


A 

- 1  ' 

X 

b 

-A 

- 1 

.  y . 

-b 

The  Newton  equation  for  the  centering  problem  is 


'  AT 

-AT  ' 

'  D1 

0 

A 

-/  ' 

'  AT9l  ■ 

-I 

-I 

0 

d2 

-A 

-J 

Ayn  t 

92 

where 

D\  =  diag(&  —  Ax  +  y)~2,  D2  =  diag(— b  +  Ax  +  y)~2 

and 


gi  =  diag(6  —  Ax  +  y)  1 1  —  diag(— b  +  Ax  +  y)  1 1 
g2  =  tl  —  diag(6  —  Ax  +  y)^1!  —  diag(— b  +  Ax  +  y)_1l. 

If  we  multiply  out  the  leftliand  side,  this  can  be  simplified  as 


'  At{D±  +  D2)A  -At{D1  -  D2)  ' 

ATgi 

—  (Di  —  D2)A  D\  +  D2 

At/nt 

9-2 
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Applying  block  elimination  to  A ynt,  we  can  reduce  this  to 

ATDAAxnt  =  —ATg  (11.61) 


where 

D  =  ADiD2(Di  +  D2)~l  =  2(diag(y)2  +  diag(6  -  Ax)2)^1 

and 

9  =  9i  +  (D\  —  D2)(D\  +  D2)~1g2. 

After  solving  for  Axnt,  we  obtain  A ynt  from 

Aj/nt  =  {D\  +  D2 )  1(— g2  +  {D\  —  D2)AAxnt). 

It  is  interesting  to  note  that  (11.61)  are  the  normal  equations  of  a  weighted  least- 
squares  problem 

minimize  ||£>1/2(AAa;  +  D~1g)\\2. 

In  other  words,  the  cost  of  solving  the  tj-norm  approximation  problem  is  the  cost 
of  solving  a  relatively  small  number  of  weighted  least-squares  problems  with  the 
same  matrix  A ,  and  weights  that  change  at  each  iteration.  If  A  has  structure 
that  allows  us  to  solve  the  least-squares  problem  fast  (for  example,  by  exploiting 
sparsity),  then  we  can  solve  (11.61)  fast. 


11.8.3  Semidefinite  programming  in  inequality  form 

We  consider  the  SDP 

minimize  cTx 

subject  to  xiFi  +  G  A  0, 

with  variable  x  €  R",  and  parameters  Fi, . . . ,  Fn,  G  £  Sp.  The  associated  centering 
problem,  using  the  log-determinant  barrier  function,  is 

minimize  tcTx  —  logdet(—  Y^i=i  xi^i  —  G). 

The  Newton  step  Axnt  is  found  from  H Axnt  =  —g,  where  the  Hessian  and  gradient 
are  given  by 


Hij  =  tr (S  1FiS  1Fj),  i,  j  =  l,...,n 
g%  =  tCi  +  tr(5'_1F,;),  i  =  1, . . . ,  n, 

where  S  =  —  xiF%  ~  G.  One  standard  approach  is  to  form  FL  (and  g),  and 
then  solve  the  Newton  equation  via  Cholesky  factorization. 

We  first  consider  the  unstructured  case,  i.e.,  we  assume  all  matrices  are  dense. 
We  will  also  just  keep  track  of  the  order  in  the  flop  count,  with  respect  to  the 
problem  dimensions  n  and  p.  We  first  form  A,  which  costs  order  np 2  flops.  We 
then  compute  the  matrices  for  each  i,  via  Cholesky  factorization  of  S.  and 

then  back  substitution  with  the  columns  of  J7)  (or  forming  S~x  and  multiplying 
by  Fi).  This  cost  is  order  p3  for  each  i,  so  the  total  cost  is  order  np3.  Finally, 
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we  form  Hij  as  the  inner  product  of  the  matrices  S^3Fi  and  S~1Fj,  which  costs 
order  p 2  flops.  Since  we  do  this  for  n(n  +  l)/2  such  pairs,  the  cost  is  order  n2p2. 
Solving  for  the  Newton  direction  costs  order  n3.  The  dominating  order  is  thus 
nra x{np3 ,  n2p2 ,  n  3  } . 

It  is  not  possible,  in  general,  to  exploit  sparsity  in  the  matrices  E)  and  G,  since 
H  is  often  dense,  even  when  Fi  and  G  are  sparse.  One  exception  is  when  Ft  and  G 
have  a  common  block  diagonal  structure,  in  which  case  all  the  operations  described 
above  can  be  carried  out  block  by  block. 

It  is  often  possible  to  exploit  (common)  sparsity  in  Ft  and  G  to  form  the  (dense) 
Hessian  H  more  efficiently.  If  we  can  find  an  ordering  that  results  in  S  having 
a  reasonably  sparse  Cholesky  factor,  then  we  can  compute  the  matrices  S'-1!7) 
efficiently,  and  form  F[ij  far  more  efficiently. 

One  interesting  example  that  arises  frequently  is  an  SDP  with  matrix  inequality 

diag(x)  ^  B. 

This  corresponds  to  Fi  =  Ea ,  where  Eu  is  the  matrix  with  i,i  entry  one  and  all 
others  zero.  In  this  case,  the  matrix  F[  can  be  found  very  efficiently: 

Hij  =  ( S~% , 

where  S  =  B  —  diag(x).  The  cost  of  forming  F[  is  thus  the  cost  of  forming  S'-1, 
which  is  at  most  (i.e.,  when  no  other  structure  is  exploited)  order  n3. 


11.8.4  Network  rate  optimization 


We  consider  a  variation  on  the  optimal  network  flow  problem  described  in  §10.4.3 
(page  550) ,  which  is  sometimes  called  the  network  rate  optimization  problem.  The 
network  is  described  as  a  directed  graph  with  L  arcs  or  links.  Goods,  or  packets 
of  information,  travel  on  the  network,  passing  through  the  links.  The  network 
supports  n  flows,  with  (nonnegative)  rates  x\, . . ,  ,xn,  which  are  the  optimization 
variables.  Each  flow  moves  along  a  fixed,  or  pre-determined,  path  (or  route )  in  the 
network,  from  a  source  node  to  a  destination  node.  Each  link  can  support  multiple 
flows  passing  through  it.  The  total  traffic  on  a  link  is  the  sum  of  the  flow  rates  of 
the  flows  that  travel  over  the  link.  Each  link  has  a  positive  capacity,  which  is  the 
maximum  total  traffic  it  can  handle. 

We  can  describe  these  link  capacity  limits  using  the  flow-link  incidence  matrix 
A  £  Rixn,  defined  as 


J  1  flow  j  passes  through  link  i 
\  0  otherwise. 


The  total  traffic  on  link  i  is  then  given  by  (Ax)i,  so  the  link  capacity  constraints 
can  be  expressed  as  Ax  A  c,  where  Cj  is  the  capacity  of  link  i.  Usually  each  path 
passes  through  only  a  small  fraction  of  the  total  number  of  links,  so  the  matrix  A 
is  sparse. 

In  the  network  rate  problem  the  paths  are  fixed  (and  encoded  in  the  matrix  A, 
which  is  a  problem  parameter);  the  variables  are  the  flow  rates  x*.  The  objective 
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is  to  choose  the  flow  rates  to  maximize  a  separable  utility  function  U,  given  by 

U(x)  =  U\{x\)  H - 1-  Un(xn). 

We  assume  that  each  tZ;  (and  hence,  U )  is  concave  and  nondecreasing.  We  can 
think  of  Ui{xi)  as  the  income  derived  from  supporting  the  ith  flow  at  rate  Xf,  U(x) 
is  then  the  total  income  associated  with  the  flows.  The  network  rate  optimization 
problem  is  then 

maximize  U(x)  ,  . 

subject  to  Ax  A  c,  x  y  0, 
which  is  a  convex  optimization  problem. 

Let  us  apply  the  barrier  method  to  solve  this  problem.  At  each  step  we  must 
minimize  a  function  of  the  form 

L  n 

- tU (x)  -  ^  log(c  -  Ax)i  -  ^2  log  Xj , 

*= 1  jf=l 

using  Newton’s  method.  The  Newton  step  Aa:nt  is  found  by  solving  the  linear 
equations 

(Do  +  AT  D\A  +  D2)Axnt  =  —g, 

where 

D0  =  —t diag([/"(a;), . . . ,  U"(x)) 

D\  =  diag(l/(c  -  Ax)\, . . . ,  l/(c  —  Ax)\) 

D2  =  diag(l/x?,.. . ,  l/x2n) 

are  diagonal  matrices,  and  g  €  Rn.  We  can  describe  the  sparsity  structure  of  this 
n  x  n  coefficient  matrix  precisely: 

(Dq  +  AT D\A  +  D2)ij  7^  0 

if  and  only  if  flow  i  and  flow  j  share  a  link.  If  the  paths  are  relatively  short,  and 
each  link  has  relatively  few  paths  passing  through  it,  then  this  matrix  is  sparse,  so 
a  sparse  Cholesky  factorization  can  be  used.  We  can  also  solve  the  Newton  system 
efficiently  when  some,  but  not  too  many,  of  the  rows  and  columns  are  relatively 
dense.  This  occurs  when  a  few  of  the  flows  intersect  with  a  large  number  of  the 
other  flows,  which  might  occur  if  a  few  flows  are  relatively  long. 

We  can  also  use  the  matrix  inversion  lemma  to  compute  the  Newton  step  by 
solving  a  system  with  L  x  L  coefficient  matrix,  with  form 

(D2  +  A(D0  +  D2)~1AT)y  =  -A(D0  +  D2)^g, 

and  then  computing 

Axnt  =  -( Do  +  D2)~1(g  +  ATy). 

Here  too  we  can  precisely  describe  the  sparsity  pattern: 

(D2+A(Do  +  D2)~1AT)ij^0 

if  and  only  if  there  is  a  path  that  passes  through  link  i  and  link  j.  If  most  paths 
are  short,  this  matrix  is  sparse.  This  matrix  will  be  sparse,  with  a  few  dense  rows 
and  columns,  if  there  are  a  few  bottlenecks,  i.e.,  a  few  links  over  which  many  flows 
travel. 
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[BTN01],  Renegar  [RenOl],  and  Peng,  Roos,  and  Terlaky  [PRT02]. 
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Exercises 


The  barrier  method 

11.1  Barrier  method  example.  Consider  the  simple  problem 

minimize  x2  +  1 
subject  to  2  <  x  <  4, 

which  has  feasible  set  [2,4],  and  optimal  point  x*  =  2.  Plot  fo,  and  tfo  +  <j>,  for  several 
values  of  t  >  0,  versus  x.  Label  x*(t). 

11.2  What  happens  if  the  barrier  method  is  applied  to  the  LP 

minimize  X2 

subject  to  xi  <  X2,  0  <  *2, 

with  variable  x  £  R2? 

11.3  Boundedness  of  centering  problem.  Suppose  the  sublevel  sets  of  (11.1), 

minimize  fo(x) 

subject  to  fi(x)  <  0,  i  =  1, . . . ,  m 
Ax  =  b , 

are  bounded.  Show  that  the  sublevel  sets  of  the  associated  centering  problem, 

minimize  tfo(x)  +  <f>(x) 
subject  to  Ax  —  b, 

are  bounded. 

11.4  Adding  a  norm  bound  to  ensure  strong  convexity  of  the  centering  problem.  Suppose  we 
add  the  constraint  xTx  <  R2  to  the  problem  (11.1): 

minimize  fo(x) 

subject  to  fi(x)  <  0,  i  =  1, . . .  ,m 
Ax  =  b 
xTx  <  R2. 

Let  denote  the  logarithmic  barrier  function  for  this  modified  problem.  Find  a  >  0  for 
which  V2(tfo(x)  +  <j>(x))  y  al  holds,  for  all  feasible  x. 

11.5  Barrier  method  for  second-order  cone  programming.  Consider  the  SOCP  (without  equality 
constraints,  for  simplicity) 


minimize 


fTx 


subject  to  ||  AiX  +  ||2  <  cj  x  +  di 


i  =  1, 


(11.63) 


The  constraint  functions  in  this  problem  are  not  differentiable  (since  the  Euclidean  norm 
|| w|| 2  is  not  differentiable  at  u  =  0)  so  the  (standard)  barrier  method  cannot  be  applied. 
In  §11.6,  we  saw  that  this  SOCP  can  be  solved  by  an  extension  of  the  barrier  method 
that  handles  generalized  inequalities.  (See  example  11.8,  page  599,  and  page  601.)  In  this 
exercise,  we  show  how  the  standard  barrier  method  (with  scalar  constraint  functions)  can 
be  used  to  solve  the  SOCP. 

We  first  reformulate  the  SOCP  as 


minimize 


fTx 


subject  to  \\AiX  +  bi\\%/(cJx  +  di)  <cfx  +  di,  i=  1,..., 
cf x  +  di  >  0,  *  =  l,...,m. 


(11.64) 
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The  constraint  function 


fi(x)  = 


||  AjX  +  bj|| 
cj  x  +  di 


—  ci  x  —  di 


is  the  composition  of  a  quadratic-over-linear  function  with  an  affine  function,  and  is  twice 
differentiable  (and  convex),  provided  we  define  its  domain  as  dom/;  =  {x  \  cfx  +  di  >  0}. 
Note  that  the  two  problems  (11.63)  and  (11.64)  are  not  exactly  equivalent.  If  cfx*  +di  =  0 
for  some  i,  where  x *  is  the  optimal  solution  of  the  SOCP  (11.63),  then  the  reformulated 
problem  (11.64)  is  not  solvable;  x *  is  not  in  its  domain.  Nevertheless  we  will  see  that 
the  barrier  method,  applied  to  (11.64),  produces  arbitrarily  accurate  suboptimal  solutions 
of  (11.64),  and  hence  also  for  (11.63). 


(a)  Form  the  log  barrier  for  the  problem  (11.64).  Compare  it  to  the  log  barrier  that 
arises  when  the  SOCP  (11.63)  is  solved  using  the  barrier  method  for  generalized 
inequalities  (in  §11.6). 

(b)  Show  that  if  tfTx  +  <fr(x)  is  minimized,  the  minimizer  x*(t)  is  2m/t-suboptimal  for 
the  problem  (11.63).  It  follows  that  the  standard  barrier  method,  applied  to  the 
reformulated  problem  (11.64),  solves  the  SOCP  (11.63),  in  the  sense  of  producing 
arbitrarily  accurate  suboptimal  solutions.  This  is  the  case  even  though  the  optimal 
point  x *  need  not  be  in  the  domain  of  the  reformulated  problem  (11.64). 


11.6  General  barriers.  The  log  barrier  is  based  on  the  approximation  —  (1/t)  log(— u)  of  the 
indicator  function  I-(u)  (see  §11.2.1,  page  563).  We  can  also  construct  barriers  from 
other  approximations,  which  in  turn  yield  generalizations  of  the  central  path  and  barrier 
method.  Let  ft  :  R  — ^  R  be  a  twice  differentiable,  closed,  increasing  convex  function, 
with  dom  ft  =  —  R++.  (This  implies  h(u)  — »  oo  as  u  — >  0.)  One  such  function  is 
ft(u)  =  —  log(— u);  another  example  is  ft(u)  =  —1/u  (for  u  <  0). 

Now  consider  the  optimization  problem  (without  equality  constraints,  for  simplicity) 


minimize  fo  (x) 

subject  to  fi(x)  <  0,  i  =  1, . . . ,  m, 


where  fi  are  twice  differentiable.  We  define  the  h-barrier  for  this  problem  as 


m 

<t>h(x)  =  y 

i—1 

with  domain  {a;  |  fi(x)  <0,  i  =  1, . . .  ,m}.  When  ft(u)  =  —  log(— u),  this  is  the  usual 
logarithmic  barrier;  when  ft(u )  =  —1/u,  </>h  is  called  the  inverse  barrier.  We  define  the 
h-central  path  as 

x*(t)  =  argmin  tf0(x)  +  (j>h{x), 

where  t  >  0  is  a  parameter.  (We  assume  that  for  each  t,  the  minimizer  exists  and  is 
unique.) 

(a)  Explain  why  tfo(x)  +  (j>h{x)  is  convex  in  x,  for  each  t  >  0. 

(b)  Show  how  to  construct  a  dual  feasible  A  from  x*(t).  Find  the  associated  duality  gap. 

(c)  For  what  functions  ft  does  the  duality  gap  found  in  part  (b)  depend  only  on  t  and 
m  (and  no  other  problem  data)? 

11.7  Tangent  to  central  path.  This  problem  concerns  dx*(t)/dt,  which  gives  the  tangent  to  the 
central  path  at  the  point  x*(t).  For  simplicity,  we  consider  a  problem  without  equality 
constraints;  the  results  readily  generalize  to  problems  with  equality  constraints. 

(a)  Find  an  explicit  expression  for  dx*(t)/dt.  Hint.  Differentiate  the  centrality  equa¬ 
tions  (11.7)  with  respect  to  t. 
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(b)  Show  that  fo{x*{t))  decreases  as  t  increases.  Thus,  the  objective  value  in  the  barrier 
method  decreases,  as  the  parameter  t  is  increased.  (We  already  know  that  the  duality 
gap,  which  is  m/t,  decreases  as  t  increases.) 

11.8  Predictor- corrector  method  for  centering  problems.  In  the  standard  barrier  method,  x*{p.t) 
is  computed  using  Newton’s  method,  starting  from  the  initial  point  x*{t).  One  alternative 
that  has  been  proposed  is  to  make  an  approximation  or  prediction  x  of  x*{qit),  and  then 
start  the  Newton  method  for  computing  x*{fit)  from  x.  The  idea  is  that  this  should 
reduce  the  number  of  Newton  steps,  since  x  is  (presumably)  a  better  initial  point  than 
x*(t).  This  method  of  centering  is  called  a  predictor- corrector  method,  since  it  first  makes 
a  prediction  of  what  x*{prt)  is,  then  corrects  the  prediction  using  Newton’s  method. 

The  most  widely  used  predictor  is  the  first-order  predictor,  based  on  the  tangent  to  the 
central  path,  explored  in  exercise  11.7.  This  predictor  is  given  by 

^  dx*{t) .  , 

x  =  x  (t)  H — — {pet  —  t). 


Derive  an  expression  for  the  first-order  predictor  x.  Compare  it  to  the  Newton  update 
obtained,  i.e.,  x*(t)  +  Axnt,  where  A^nt  is  the  Newton  step  for  ptfo(x)  +  <f)(x),  at  x*(t). 
What  can  you  say  when  the  objective  fo  is  linear?  (For  simplicity,  you  can  consider  a 
problem  without  equality  constraints.) 

11.9  Dual  feasible  points  near  the  central  path.  Consider  the  problem 

minimize  fo  (x) 

subject  to  fi{x)<  0,  i  =  1, ... ,  m, 


with  variable  x  £  R".  We  assume  the  functions  /;  are  convex  and  twice  differentiable.  (We 
assume  for  simplicity  there  are  no  equality  constraints.)  Recall  (from  §11.2.2,  page  565) 
that  A i  =  —l/(tfi(x*(t))),  i  =  l,...,m,  is  dual  feasible,  and  in  fact,  x*(t)  minimizes 
L(x,  A).  This  allows  us  to  evaluate  the  dual  function  for  A,  which  turns  out  to  be  g{ A)  = 
fo(x*(t))  —  m/t.  In  particular,  we  conclude  that  x*(t)  is  m/t-suboptimal. 

In  this  problem  we  consider  what  happens  when  a  point  x  is  close  to  x*(t),  but  not  quite 
centered.  (This  would  occur  if  the  centering  steps  were  terminated  early,  or  not  carried 
out  to  full  accuracy.)  In  this  case,  of  course,  we  cannot  claim  that  Ai  =  — 1  /(tfi(x)), 
i  =  1 , ,m,  is  dual  feasible,  or  that  x  is  m/t-suboptimal.  However,  it  turns  out  that 
a  slightly  more  complicated  formula  does  yield  a  dual  feasible  point,  provided  x  is  close 
enough  to  centered. 

Let  Aa;nt  be  the  Newton  step  at  x  of  the  centering  problem 


minimize  tf0(x)  -  J2?=i 


A  formula  that  often  gives  a  dual  feasible  point  when  Aa;nt  is  small  [i.e.,  for  x  nearly 
centered)  is 


^  1  (,  V fi{x)T Axnt  \ 

-tfi(x)  -fi{x)  ) 


i  =  1, . .  ..  in. 


In  this  case,  the  vector  x  does  not  minimize  L{x,  A),  so  there  is  no  general  formula  for  the 
dual  function  value  g{ A)  associated  with  A.  (If  we  have  an  analytical  expression  for  the 
dual  objective,  however,  we  can  simply  evaluate  g{ A).) 

Verify  that  for  a  QCQP 


minimize  {l/2)xT  Pox  +  q]/ x  +  ro 

subject  to  (l/2)xTPiX  +  qf x  +  r;  <  0,  i  =  l,...,m, 


the  formula  for  A  yields  a  dual  feasible  point  {i.e.,  A  X  0  and  L{x,  A)  is  bounded  below) 
when  AiCnt  is  sufficiently  small. 
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Hint.  Define 


xo  =  x  +  AXnt, 


Xi  =  x  — 


1 

t\ifi(x ) 


AXnt  , 


i  =  1, . ,  m. 


Show  that 

m 

V/o(x0)  +  ^  A  iVfi(xi)  =  0. 

i=  1 

Now  use  fi(z)  >  fi(xi)  +  V  fi(xi)T  (z  —  Xi),  i  =  0,  to  derive  a  lower  bound  on 

L(*,A). 

11.10  Another  parametrization  of  the  central  path.  We  consider  the  problem  (11.1),  with  central 
path  x*(t)  for  t  >  0,  defined  as  the  solution  of 


minimize  tf0(x)  -  J2?=i  l°g(~fi{x)) 
subject  to  Ax  =  b. 


In  this  problem  we  explore  another  parametrization  of  the  central  path. 
For  u  >  p* ,  let  z*(u)  denote  the  solution  of 

minimize  -  log(w  -  fo(x))  -  J2?=i 
subject  to  Ax  =  b. 


Show  that  the  curve  defined  by  z*(u),  for  u  >  p* ,  is  the  central  path.  (In  other  words, 
for  each  u  >  p* ,  there  is  a  t  >  0  for  which  x*(t)  =  z*(u),  and  conversely,  for  each  t  >  0, 
there  is  an  u  >  p*  for  which  z*(u)  =  x*(t)). 

11.11  Method  of  analytic  centers.  In  this  problem  we  consider  a  variation  on  the  barrier  method, 
based  on  the  parametrization  of  the  central  path  described  in  exercise  11.10.  For  simplic¬ 
ity,  we  consider  a  problem  with  no  equality  constraints, 

minimize  /o  (x) 

subject  to  fi(x)  <  0,  i  —  1, . . .  ,m. 

The  method  of  analytic  centers  starts  with  any  strictly  feasible  initial  point  ,  and  any 
>  fo(x^).  We  then  set 

u(1)  =  0uto)  +  (1  —  9)fo(xm), 

where  9  £  (0, 1)  is  an  algorithm  parameter  (usually  chosen  small),  and  then  compute  the 
next  iterate  as 

x ^  =  z*(u^) 

(using  Newton’s  method,  starting  from  a;*-0^).  Here  z*(s)  denotes  the  minimizer  of 


-  log(s  -  f0(x))  -  ^2  l°g(-fi(x))’ 

i=l 


which  we  assume  exists  and  is  unique.  This  process  is  then  repeated. 
The  point  z*(s)  is  the  analytic  center  of  the  inequalities 

fo(x)  <  S,  fl(x)  <  0,  .  .  .  ,fm(x)  <  0, 


hence  the  algorithm  name. 

Show  that  the  method  of  centers  works,  i.e.,  x ^  converges  to  an  optimal  point.  Find  a 
stopping  criterion  that  guarantees  that  x  is  e-suboptimal,  where  e  >  0. 

Hint.  The  points  x ^  are  on  the  central  path;  see  exercise  11.10.  Use  this  to  show  that 


u+  —  p*  < 


m  +  9 
m+  1 


(■ u-P *), 


where  u  and  u+  are  the  values  of  u  on  consecutive  iterations. 
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11.12  Barrier  method  for  convex-concave  games.  We  consider  a  convex-concave  game  with 
inequality  constraints, 

minimize™  maximize2  fo(w,z ) 
subject  to  fi(w)  <0,  i  =  1, . . . ,  m 

fi{z)  <  0,  i  =  1, . ...  rh. 

Here  w  £  Rn  is  the  variable  associated  with  minimizing  the  objective,  and  a  £  Rn  is 
the  variable  associated  with  maximizing  the  objective.  The  constraint  functions  /,  and  fi 
are  convex  and  differentiable,  and  the  objective  function  fo  is  differentiable  and  convex- 
concave,  i.e.,  convex  in  w,  for  each  z,  and  concave  in  z,  for  each  w.  We  assume  for 
simplicity  that  dom  fo  =  R”  x  Rn . 

A  solution  or  saddle-point  for  the  game  is  a  pair  w* ,  z* ,  for  which 

fo(w*,z)  <  f0(w*,z*)  <  f0(w,z*) 

holds  for  every  feasible  w  and  2.  (For  background  on  convex-concave  games  and  functions, 
see  §5.4.3,  §10.3.4  and  exercises  3.14,  5.24,  5.25,  10.10,  and  10.13.)  In  this  exercise  we 
show  how  to  solve  this  game  using  an  extension  of  the  barrier  method,  and  the  infeasible 
start  Newton  method  (see  §10.3). 

(a)  Let  t  >  0.  Explain  why  the  function 

m  rh 

tfo(u>,  z)-^2,  log(_/i(W))  +  X!  loS(~/i(z)) 
i= 1  i=  1 

is  convex-concave  in  ( w,z ).  We  will  assume  that  it  has  a  unique  saddle-point, 
(w*  (t) ,  z*  (t)) ,  which  can  be  found  using  the  infeasible  start  Newton  method. 

(b)  As  in  the  barrier  method  for  solving  a  convex  optimization  problem,  we  can  derive 
a  simple  bound  on  the  suboptimality  of  (w* (t) ,  z*  (t)) ,  which  depends  only  on  the 
problem  dimensions,  and  decreases  to  zero  as  t  increases.  Let  W  and  Z  denote  the 
feasible  sets  for  w  and  z, 

W  =  {w  \  fi(w)  <  0,  *  i=s  1, ... ,  m},  Z  =  {z  |  /»(«)<  0,  i  =  1, . . . ,  rh}. 
Show  that 

777 

fo(w*{t),z*(t))  <  inf  fo(w,z  (£))  +  —, 

wew  1 

777 

fo(w*{t),z*(t))  >  sup  fo{w  (t),z)  -  — , 

and  therefore 

sup  fo(w*(t),z)  -  inf  fo{w,z*(t))  <  m  +  . 

zCiZ  we  w  t 

Self-concordance  and  complexity  analysis 

11.13  Self- concordance  and  negative  entropy. 

(a)  Show  that  the  negative  entropy  function  a;  log  a;  (on  R++)  is  not  self-concordant. 

(b)  Show  that  for  any  t  >  0,  fa:  log  a;  —  log  a:  is  self-concordant  (on  R++). 

11.14  Self- concordance  and  the  centering  problem.  Let  <j>  be  the  logarithmic  barrier  function  of 
problem  (11.1).  Suppose  that  the  sublevel  sets  of  (11.1)  are  bounded,  and  that  tfo  +  <f>  is 
closed  and  self-concordant.  Show  that  t\/2fo{x)  +  V2< f>(x)  >-  0,  for  all  x  £  dom  (>.  Hint. 
See  exercises  9.17  and  11.3. 
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Barrier  method  for  generalized  inequalities 

Generalized  logarithm  is  K -increasing.  Let  ip  be  a  generalized  logarithm  for  the  proper 
cone  K.  Suppose  y  >~k  0. 

(a)  Show  that  Y7ip(y)  Yk*  0,  i.e.,  that  ip  is  A'-nondecreasing.  Hint.  If  Vi/>(y)  0, 

then  there  is  some  w  >~k  0  for  which  wTS7ip(y)  <  0.  Use  the  inequality  ip(sw)  < 
ip{y)  +  Vip(y)T (sw  —  y),  with  s  >  0. 

(b)  Now  show  that  Y7ip(y)  >~k *  0,  i.e.,  that  ip  is  A'-increasing.  Hint.  Show  that 
V2ip(y)  A  0,  Vip(y)  0  imply  Vip(y)  >~K *  0. 

[NN94,  page  41]  Properties  of  a  generalized  logarithm.  Let  ip  be  a  generalized  logarithm 
for  the  proper  cone  K,  with  degree  9.  Prove  that  the  following  properties  hold  at  any 
y  >-k  0. 

(a)  'Vip(sy)  =  Vip(y)/s  for  all  s  >  0. 

(b)  Vip{y)  =  -X/2iP(y)y. 

(c)  yTVip2(y)y  =  -0. 

(d)  Vip(y)TV2ip(y)~1Vip(y)  =  -9. 

Dual  generalized  logarithm.  Let  ip  be  a  generalized  logarithm  for  the  proper  cone  K,  with 
degree  9.  Show  that  the  dual  generalized  logarithm  ip,  defined  in  (11.49),  satisfies 

ip(sv)  =  i p(v)  +  9  log  s, 

for  v  >~k-  0,  s  >  0. 

Is  the  function 

iP(y)  =  log  fyn+1  -  ^  Vi  \  , 

V  2/rt+l  ) 

with  donpf  =  {y  £  R"+1  |  yn+i  >  1 3/?}>  a  generalized  logarithm  for  the  second- 

order  cone  in  Rn+1? 

Implementation 

11.19  Yet  another  method  for  computing  the  Newton  step.  Show  that  the  Newton  step  for  the 
barrier  method,  which  is  given  by  the  solution  of  the  linear  equations  (11.14),  can  be 
found  by  solving  a  larger  set  of  linear  equations  with  coefficient  matrix 

iV2/o(*)  +  Yli  z^)V2fi(x)  Df(x)T  AT 

Df(x)  —  diag(/(x))2  0 

A  0  0 

where  f{x)  =  {fi(x), ...,  fm(x)). 

For  what  types  of  problem  structure  might  solving  this  larger  system  be  interesting? 

11.20  Network  rate  optimization  via  the  dual  problem.  In  this  problem  we  examine  a  dual  method 
for  solving  the  network  rate  optimization  problem  of  §11.8.4.  To  simplify  the  presentation 
we  assume  that  the  utility  functions  Ui  are  strictly  concave,  with  dom  Ui  =  R++,  and 
that  they  satisfy  U'(xi)  — »  oo  as  Xi  — >  0  and  !/'(*;)  — >  0  as  Xi  — »  oo. 

(a)  Express  the  dual  problem  of  (11.62)  in  terms  of  the  conjugate  utility  functions 
Vi  =  (-Ui)*,  defined  as 

Vi(  A)  =  sup(A*  +  Ui(x )). 

x>0 

Show  that  dom  V)  =  —  R++,  and  that  for  each  A  <  0  there  is  a  unique  x  with 
U’(x)  =  -  A. 

(b)  Describe  a  barrier  method  for  the  dual  problem.  Compare  the  complexity  per  iter¬ 
ation  with  the  complexity  of  the  method  in  §11.8.4.  Distinguish  the  same  two  cases 
as  in  §11.8.4  (Afr A  is  sparse  and  AAT  is  sparse). 


11.15 


11.16 


11.17 


11.18 
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Numerical  experiments 

11.21  Log-Chebyshev  approximation  with  bounds.  We  consider  an  approximation  problem:  find 
x  £  Rn,  that  satisfies  the  variable  bounds  l  <  x  <  u,  and  yields  Ax  ~  b,  where  b  £  Rm. 
You  can  assume  that  l  -<  u,  and  b  y  0  (for  reasons  we  explain  below).  We  let  aj  denote 
the  ith  row  of  the  matrix  A. 

We  judge  the  approximation  Ax  «  6  by  the  maximum  fractional  deviation,  which  is 


max  max{(af x)/bi,bi/(ajx)} 


max 


max{af  x,  bi} 
min {afx,  bi}  ’ 


when  Ax  >-  0;  we  define  the  maximum  fractional  deviation  as  oo  if  Ax  0. 

The  problem  of  minimizing  the  maximum  fractional  deviation  is  called  the  fractional 
Chebyshev  approximation  problem,  or  the  logarithmic  Chebyshev  approximation  problem, 
since  it  is  equivalent  to  minimizing  the  objective 


max  |  log  aj x  —  log  bi  \ . 


(See  also  exercise  6.3,  part  (c).) 

(a)  Formulate  the  fractional  Chebyshev  approximation  problem  (with  variable  bounds) 
as  a  convex  optimization  problem  with  twice  differentiable  objective  and  constraint 
functions. 

(b)  Implement  a  barrier  method  that  solves  the  fractional  Chebyshev  approximation 
problem.  You  can  assume  an  initial  point  a:®,  satisfying  l  -<  X®  -<  u,  Ax ®  >-  0,  is 
known. 


11.22  Maximum  volume  rectangle  inside  a  polyhedron.  Consider  the  problem  described  in  exer¬ 
cise  8.16,  i.e.,  finding  the  maximum  volume  rectangle  lZ  =  {x\l<x<u}  that  lies  in 
a  polyhedron  described  by  a  set  of  linear  inequalities,  V  =  {x  \  Ax  A  b}.  Implement  a 
barrier  method  for  solving  this  problem.  You  can  assume  that  by  0,  which  means  that 
for  small  l  -<  0  and  u  y  0,  the  rectangle  7Z  lies  inside  V. 

Test  your  implementation  on  several  simple  examples.  Find  the  maximum  volume  rect¬ 
angle  that  lies  in  the  polyhedron  defined  by 


A  = 


-1 

-4 

1 

4 

0 


6=1. 


Plot  this  polyhedron,  and  the  maximum  volume  rectangle  that  lies  inside  it. 

11.23  SDP  bounds  and  heuristics  for  the  two-way  partitioning  problem.  In  this  exercise  we 
consider  the  two-way  partitioning  problem  (5.7),  described  on  page  219,  and  also  in  ex¬ 
ercise  5.39: 

minimize  xTWx 

subject  to  x?  =  1,  *  =  1, . . . ,  n,  '  ’ 

with  variable  x  £  Rn.  We  assume,  without  loss  of  generality,  that  W  £  S™  satisfies 
Wu  =  0.  We  denote  the  optimal  value  of  the  partitioning  problem  as  p* ,  and  x*  will 
denote  an  optimal  partition.  (Note  that  —  x*  is  also  an  optimal  partition.) 

The  Lagrange  dual  of  the  two-way  partitioning  problem  (11.65)  is  given  by  the  SDP 


maximize  —  lTu 

subject  to  W  +  diag(^)  X  0, 


(11.66) 
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with  variable  v  £  Rn.  The  dual  of  this  SDP  is 
minimize  tr  (WX) 

subject  to  X  >z  0  (11.67) 

Xu  —  1,  i  —  1,  • •  •  ,  77., 

with  variable  X  £  S™.  (This  SDP  can  be  interpreted  as  a  relaxation  of  the  two-way 
partitioning  problem  (11.65);  see  exercise  5.39.)  The  optimal  values  of  these  two  SDPs 
are  equal,  and  give  a  lower  bound,  which  we  denote  d* ,  on  the  optimal  value  p* .  Let  v * 
and  X*  denote  optimal  points  for  the  two  SDPs. 

(a)  Implement  a  barrier  method  that  solves  the  SDP  (11.66)  and  its  dual  (11.67),  given 
the  weight  matrix  W.  Explain  how  you  obtain  nearly  optimal  v  and  A',  give  for¬ 
mulas  for  any  Hessians  and  gradients  that  your  method  requires,  and  explain  how 
you  compute  the  Newton  step.  Test  your  implementation  on  some  small  problem 
instances,  comparing  the  bound  you  find  with  the  optimal  value  (which  can  be  found 
by  checking  the  objective  value  of  all  2"  partitions).  Try  your  implementation  on  a 
randomly  chosen  problem  instance  large  enough  that  you  cannot  find  the  optimal 
partition  by  exhaustive  search  ( e.g .,  n  =  100). 

(b)  A  heuristic  for  partitioning.  In  exercise  5.39,  you  found  that  if  X*  has  rank  one, 
then  it  must  have  the  form  A'*  =  x*(x*)T,  where  x *  is  optimal  for  the  two-way 
partitioning  problem.  This  suggests  the  following  simple  heuristic  for  finding  a  good 
partition  (if  not  the  best):  solve  the  SDPs  above,  to  find  X *  (and  the  bound  d*). 
Let  v  denote  an  eigenvector  of  A'*  associated  with  its  largest  eigenvalue,  and  let 
x  =  sign(w).  The  vector  x  is  our  guess  for  a  good  partition. 

Try  this  heuristic  on  some  small  problem  instances,  and  the  large  problem  instance 
you  used  in  part  (a).  Compare  the  objective  value  of  your  heuristic  partition,  xTWx, 
with  the  lower  bound  d* . 

(c)  A  randomized  method.  Another  heuristic  technique  for  finding  a  good  partition, 
given  the  solution  A*  of  the  SDP  (11.67),  is  based  on  randomization.  The  method 
is  simple:  we  generate  independent  samples  x^\  . . . ,  from  a  normal  distribution 
on  R",  with  zero  mean  and  covariance  X* .  For  each  sample  we  consider  the  heuristic 
approximate  solution  x ^  =  sign):^)).  We  then  take  the  best  among  these,  i.e., 
the  one  with  lowest  cost.  Try  out  this  procedure  on  some  small  problem  instances, 
and  the  large  problem  instance  you  considered  in  part  (a). 

(d)  A  greedy  heuristic  refinement.  Suppose  you  are  given  a  partition  x,  i.e.,  Xi  £  (—1, 1}, 
i  =  1, ...  ,n.  How  does  the  objective  value  change  if  we  move  element  i  from  one 
set  to  the  other,  i.e.,  change  Xi  to  —  xfit  Now  consider  the  following  simple  greedy 
algorithm:  given  a  starting  partition  x,  move  the  element  that  gives  the  largest 
reduction  in  the  objective.  Repeat  this  procedure  until  no  reduction  in  objective 
can  be  obtained  by  moving  an  element  from  one  set  to  the  other. 

Try  this  heuristic  on  some  problem  instances,  including  the  large  one,  starting  from 
various  initial  partitions,  including  x  =  1,  the  heuristic  approximate  solution  found 
in  part  (b),  and  the  randomly  generated  approximate  solutions  found  in  part  (c). 
How  much  does  this  greedy  refinement  improve  your  approximate  solutions  from 
parts  (b)  and  (c)? 

11.24  Barrier  and  primal-dual  interior-point  methods  for  quadratic  programming.  Implement 
a  barrier  method,  and  a  primal-dual  method,  for  solving  the  QP  (without  equality  con¬ 
straints,  for  simplicity) 

minimize  (l/2)a rrPx  +  qTx 

subject  to  Ax  A  b, 

with  A  £  Rmxn.  You  can  assume  a  strictly  feasible  initial  point  is  given.  Test  your  codes 
on  several  examples.  For  the  barrier  method,  plot  the  duality  gap  versus  Newton  steps. 
For  the  primal-dual  interior-point  method,  plot  the  surrogate  duality  gap  and  the  norm 
of  the  dual  residual  versus  iteration  number. 
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Mathematical  background 


In  this  appendix  we  give  a  brief  review  of  some  basic  concepts  from  analysis  and 
linear  algebra.  The  treatment  is  by  no  means  complete,  and  is  meant  mostly  to  set 
out  our  notation. 


A.l  Norms 

A. 1.1  Inner  product,  Euclidean  norm,  and  angle 

The  standard  inner  product  on  R”,  the  set  of  real  n- vectors,  is  given  by 

n 

(x,y)  =  xTy  =  ^2,Xiyi, 

i= 1 

for  x,y  £  R”.  In  this  book  we  use  the  notation  xTy,  instead  of  (x,y).  The 
Euclidean  norm ,  or  ^-norm,  of  a  vector  x  £  R"  is  defined  as 

Mb  =  {xTx)1/2  =  (xl  +  ---  +  x2n)1/2.  (A.l) 

The  Cauchy-Schwartz  inequality  states  that  \xTy\  <  ||a;||2||j/||2  for  any  x,y  £  R". 
The  (unsigned)  angle  between  nonzero  vectors  x,y  £  R"  is  defined  as 

z(x'#,  =  c“"(h»)' 

where  we  take  cos-1(u)  £  [0, 7 r].  We  say  x  and  y  are  orthogonal  if  xTy  =  0. 

The  standard  inner  product  on  Rmx”,  the  set  of  m  x  n  real  matrices,  is  given 

by 

m  n 

(X,Y)=tr(XTY)  = 

*= 1 1= i 

for  X,  Y  £  Rmxn.  (Here  tr  denotes  trace  of  a  matrix,  i.e.,  the  sum  of  its  diagonal 
elements.)  We  use  the  notation  tr (XTY)  instead  of  (X,Y).  Note  that  the  inner 
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product  of  two  matrices  is  the  inner  product  of  the  associated  vectors,  in  Rmn, 
obtained  by  listing  the  coefficients  of  the  matrices  in  some  order,  such  as  row 
major. 

The  Frobenius  norm  of  a  matrix  X  G  Rmxn  js  given  by 


\\X\\F  =  {tv{XTX))l/2 


1/2 


EEE 

,*=i  i 


(A.2) 


The  Frobenius  norm  is  the  Euclidean  norm  of  the  vector  obtained  by  listing  the 
coefficients  of  the  matrix.  (The  ^2-norm  of  a  matrix  is  a  different  norm;  see  §A.1.5.) 
The  standard  inner  product  on  S™,  the  set  of  symmetric  n  x  n  matrices,  is  given 

by 

n  n  n 

(X,  Y)  =  tv(XY)  Y.Y.  X'.iyU  =  E  X^i  +  2  E 

*= i  i=1  *=i  *<i 


A. 1.2  Norms,  distance,  and  unit  ball 

A  function  /  :  Rn  — >  R  with  dom  /  =  R"  is  called  a  norm  if 

•  f  is  nonnegative:  f(x)  >  0  for  all  x  G  Rra 

•  /  is  definite:  /( x)  =  0  only  if  x  =  0 

•  /  is  homogeneous:  f(tx)  =  \t\f(x),  for  all  x  G  R"  and  t  G  R 

•  f  satisfies  the  triangle  inequality:  /( x  +  y)  <  f(x)  +  f(y),  for  all  x,  y  G  R" 

We  use  the  notation  f(x)  =  ||x||,  which  is  meant  to  suggest  that  a  norm  is  a 
generalization  of  the  absolute  value  on  R.  When  we  specify  a  particular  norm, 
we  use  the  notation  ||x||symb,  where  the  subscript  is  a  mnemonic  to  indicate  which 
norm  is  meant. 

A  norm  is  a  measure  of  the  length  of  a  vector  x;  we  can  measure  the  distance 
between  two  vectors  x  and  y  as  the  length  of  their  difference,  i.e., 

dist  (x,y)  =  \\x  —  y\\ . 

We  refer  to  dist(x,  y)  as  the  distance  between  x  and  y,  in  the  norm  ||  •  ||. 

The  set  of  all  vectors  with  norm  less  than  or  equal  to  one, 

B  =  {x  €  Rra  |  ||x||  <  1}, 

is  called  the  unit  ball  of  the  norm  ||  •  ||.  The  unit  ball  satisfies  the  following  prop¬ 
erties: 

•  B  is  symmetric  about  the  origin,  i.e.,  x  G  B  if  and  only  if  —  x  G  B 

•  B  is  convex 

•  B  is  closed,  bounded,  and  has  nonempty  interior 
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Conversely,  if  C  C  Rn  is  any  set  satisfying  these  three  conditions,  then  it  is  the 
unit  ball  of  a  norm,  which  is  given  by 

Ml  =  (sup{<  >  0  |  tx  €  C})-1 . 


A. 1.3  Examples 

The  simplest  example  of  a  norm  is  the  absolute  value  on  R.  Another  simple 
example  is  the  Euclidean  or  f 2-norm  on  R",  defined  above  in  (A.l).  Two  other 
frequently  used  norms  on  R”  are  the  sum-absolute-value,  or  i\-norm,  given  by 

Mi  =  ki|  H - h  \xn\, 

and  the  Chebyshev  or  i^-norm,  given  by 

Mloo  =  max{ki|,  •  •  ■  ,  \xn\}- 

These  three  norms  are  part  of  a  family  parametrized  by  a  constant  traditionally 
denoted  p,  with  p  >  1:  the  £p-norm  is  defined  by 

Mlp  =  (Np +  •••  +  !  xn\n1/p. 

This  yields  the  £i-norm  when  p  =  1  and  the  Euclidean  norm  when  p  =  2.  It  is  easy 
to  show  that  for  any  x  6  R", 

lim  ||cc||p  =  max{|xi|, . . . ,  |a;n|}, 

p—>  OO 

so  the  ^oo-norm  also  fits  in  this  family,  as  a  limit. 

Another  important  family  of  norms  are  the  quadratic  norms.  For  P  es;+,  we 
define  the  P-quadratic  norm  as 

IMIp  =  o xTPx )1/2  =  \\P1/2x\\2. 

The  unit  ball  of  a  quadratic  norm  is  an  ellipsoid  (and  conversely,  if  the  unit  ball  of 
a  norm  is  an  ellipsoid,  the  norm  is  a  quadratic  norm). 

Some  common  norms  on  Rmxn  are  the  Frobenius  norm,  defined  above  in  (A. 2), 
the  sum-absolute-value  norm, 


n*iu  =  £Epy> 

»= 1  3—1 


and  the  maximum-absolute-value  norm, 

IIA'Ilmav  =  max{|Xjj-|  I  i  =  1,. . .  ,m,  j  =  1, . . .  ,n}. 

We  will  encounter  several  other  important  norms  of  matrices  in  §A.1.5. 
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A. 1.4  Equivalence  of  norms 

Suppose  that  ||  •  ||a  and  ||  •  ||b  are  norms  on  R™.  A  basic  result  of  analysis  is  that 
there  exist  positive  constants  a  and  j3  such  that,  for  all  x  £  Rn, 

a\\x\\a  <  ||x||b  <  /3\\x\\a. 

This  means  that  the  norms  are  equivalent ,  i.e.,  they  define  the  same  set  of  open 
subsets,  the  same  set  of  convergent  sequences,  and  so  on  (see  §A.2).  (We  con¬ 
clude  that  any  norms  on  any  finite-dimensional  vector  space  are  equivalent,  but  on 
infinite-dimensional  vector  spaces,  the  result  need  not  hold.)  Using  convex  analy¬ 
sis,  we  can  give  a  more  specific  result:  If  ||  •  ||  is  any  norm  on  R",  then  there  exists 
a  quadratic  norm  ||  •  ||p  for  which 

IMIp  <  INI  <  Vn\\x\\P 

holds  for  all  x.  In  other  words,  any  norm  on  R“  can  be  uniformly  approximated, 
within  a  factor  of  \fn,  by  a  quadratic  norm.  (See  §8.4.1.) 


A. 1.5  Operator  norms 

Suppose  ||  •  1 1 a  and  ||  •  ||b  are  norms  on  Rm  and  R”,  respectively.  We  define  the 
operator  norm  of  A  £  Rmx",  induced  by  the  norms  ||  •  ||a  and  ||  •  ||b,  as 

||*||a,b  =  sup{||Au||a  |  ||u||b  <  1}  ■ 

(It  can  be  shown  that  this  defines  a  norm  on  Rmxn.) 

When  ||  •  ||a  and  ||  •  ||b  are  both  Euclidean  norms,  the  operator  norm  of  X  is  its 
maximum  singular  value,  and  is  denoted  ||X||2: 

11*112  =  ^max(A)  =  (Amax(XTX))1/2. 

(This  agrees  with  the  Euclidean  norm  on  Rm,  when  X  £  Rmx  ,  so  there  is  no 
clash  of  notation.)  This  norm  is  also  called  the  spectral  norm  or  £2 -norm  of  X. 

As  another  example,  the  norm  induced  by  the  foo-norm  on  Rm  and  R" ,  denoted 
1 1 X | loo,  is  the  max-row- sum  norm, 


||*||oo  =  supIlIXulloo  |  IMloo  <  1}  =  max  V'|Xij-|. 

3= 1 

The  norm  induced  by  the  £i-norm  on  Rm  and  R",  denoted  ||X||i,  is  the  max- 
column-sum  norm, 

m 

||A||i  =  max  J2\Xij\- 

3  =  l,...,n  ' 
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A. 1.6  Dual  norm 

Let  ||  •  ||  be  a  norm  on  R”.  The  associated  dual  norm ,  denoted  ||  •  ||*,  is  defined  as 

INI*  =  sup{zT*  |  ||*||  <  1}. 

(This  can  be  shown  to  be  a  norm.)  The  dual  norm  can  be  interpreted  as  the 
operator  norm  of  zT ,  interpreted  as  a  1  x  n  matrix,  with  the  norm  ||  •  ||  on  R",  and 
the  absolute  value  on  R: 

INI*  =  sup{|zT*|  |  ||*||  <  1}. 

From  the  definition  of  dual  norm  we  have  the  inequality 

zTx  <  ||*|| 

which  holds  for  all  *  and  2.  This  inequality  is  tight,  in  the  following  sense:  for  any 
*  there  is  a  z  for  which  the  inequality  holds  with  equality.  (Similarly,  for  any  z 
there  is  an  *  that  gives  equality.)  The  dual  of  the  dual  norm  is  the  original  norm: 
we  have  ||x||**  =  ||*||  for  all  *.  (This  need  not  hold  in  infinite-dimensional  vector 
spaces.) 

The  dual  of  the  Euclidean  norm  is  the  Euclidean  norm,  since 
supj>T*  |  ||*||2  <  1}  =  ||z||2- 

(This  follows  from  the  Cauchy-Schwarz  inequality;  for  nonzero  z,  the  value  of  * 
that  maximizes  zTx  over  ||*||2  <  1  is  z/||.s||2-) 

The  dual  of  the  £oo-norm  is  the  fi-nonn: 

n 

SUp{zT*  |  ll*^  <  1}  =  ^  N  =  Mil) 

i— 1 

and  the  dual  of  the  fi-norm  is  the  ^oo-norm.  More  generally,  the  dual  of  the  £p-norm 
is  the  lq- norm,  where  q  satisfies  1  /p+  1  /q  =  1,  i.e.,  q  =p/{p—  1). 

As  another  example,  consider  the  £2-  or  spectral  norm  on  Rmxn.  The  associated 
dual  norm  is 

\\Z\\2*  =  sup{tr(ZTX)  |  ||X||2  <  1}, 
which  turns  out  to  be  the  sum  of  the  singular  values, 

||^||2*  =  cn(Z)  +  •  •  •  +  ar(Z)  =  tr^Z)1'2, 
where  r  =  rank  Z.  This  norm  is  sometimes  called  the  nuclear  norm. 


A. 2  Analysis 

A. 2.1  Open  and  closed  sets 

An  element  *  £  C  C  R"  is  called  an  interior  point  of  C  if  there  exists  an  e  >  0  for 
which 

{y  |  \\y~x\\2  <  e}  C  C, 
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i.e.,  there  exists  a  ball  centered  at  x  that  lies  entirely  in  C.  The  set  of  all  points 
interior  to  C  is  called  the  interior  of  C  and  is  denoted  int  C.  (Since  all  norms 
on  R"  are  equivalent  to  the  Euclidean  norm,  all  norms  generate  the  same  set  of 
interior  points.)  A  set  C  is  open  if  int  C  =  C,  i.e.,  every  point  in  C  is  an  interior 
point.  A  set  C  C  R”  is  closed  if  its  complement  R"  \  C  =  {x  G  R”  |  x  ^  C}  is 
open. 

The  closure  of  a  set  C  is  defined  as 

c\C  =  R"  \  int(R"  \  C), 

i.e.,  the  complement  of  the  interior  of  the  complement  of  C.  A  point  x  is  in  the 
closure  of  C  if  for  every  e  >  0,  there  is  a  y  £  C  with  ||x  —  y H2  <  e. 

We  can  also  describe  closed  sets  and  the  closure  in  terms  of  convergent  sequences 
and  limit  points.  A  set  C  is  closed  if  and  only  if  it  contains  the  limit  point  of  every 
convergent  sequence  in  it.  In  other  words,  if  x\,  X2,  ■  ■  ■  converges  to  x,  and  Xi  €  C, 
then  x  £  C.  The  closure  of  C  is  the  set  of  all  limit  points  of  convergent  sequences 
in  C. 

The  boundary  of  the  set  C  is  defined  as 

bd  C  =  cl  C  \  int  C. 

A  boundary  point  x  {i.e.,  a  point  x  €  bdC)  satisfies  the  following  property:  For 
all  e  >  0,  there  exists  y  €  C  and  z  C  with 

\\y-x\\2  <  e,  \\z-x\\2<e, 

i.e.,  there  exist  arbitrarily  close  points  in  C,  and  also  arbitrarily  close  points  not  in 
C .  We  can  characterize  closed  and  open  sets  in  terms  of  the  boundary  operation: 
C  is  closed  if  it  contains  its  boundary,  i.e.,  bdC  C  C.  It  is  open  if  it  contains  no 
boundary  points,  i.e.,  C  D  bdC  =  0. 

A. 2. 2  Supremum  and  infimum 

Suppose  C  C  R.  A  number  a  is  an  upper  bound  on  C  if  for  each  x  €  C,  x  <  a. 
The  set  of  upper  bounds  on  a  set  C  is  either  empty  (in  which  case  we  say  C  is 
unbounded  above),  all  of  R  (only  when  C  =  0),  or  a  closed  infinite  interval  [ b ,  00). 
The  number  b  is  called  the  least  upper  bound  or  supremum  of  the  set  C ,  and  is 
denoted  supC.  We  take  sup0  =  —00,  and  supC  =  00  if  C  is  unbounded  above. 
When  sup  C  £  C,  we  say  the  supremum  of  C  is  attained  or  achieved. 

When  the  set  C  is  finite,  sup  C  is  the  maximum  of  its  elements.  Some  authors 
use  the  notation  max  C  to  denote  supremum,  when  it  is  attained,  but  we  follow 
standard  mathematical  convention,  using  max  C  only  when  the  set  C  is  finite. 

We  define  lower  bound,  and  infimum,  in  a  similar  way.  A  number  a  is  a  lower 
bound  on  C  C  R  if  for  each  x  €  C,  a  <  x.  The  infimum  (or  greatest  lower  bound) 
of  a  set  C  C  R  is  defined  as  inf  C  =  —  sup(— C).  When  C  is  finite,  the  infimum 
is  the  minimum  of  its  elements.  We  take  inf  0  =  00,  and  inf  C  =  —00  if  C  is 
unbounded  below,  i.e.,  has  no  lower  bound. 
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A. 3  Functions 

A. 3.1  Function  notation 

Our  notation  for  functions  is  mostly  standard,  with  one  exception.  When  we  write 

we  mean  that  /  is  a  function  on  the  set  dom/  C  A  into  the  set  B-,  in  particular 
we  can  have  dom  /  a  proper  subset  of  the  set  A.  Thus  the  notation  /  :  R"  — »  Rm 
means  that  /  maps  (some)  n-vectors  into  m-vectors;  it  does  not  mean  that  f(x) 
is  defined  for  every  x  £  R".  This  convention  is  similar  to  function  declarations  in 
computer  languages.  Specifying  the  data  types  of  the  input  and  output  arguments 
of  a  function  gives  the  syntax  of  that  function;  it  does  not  guarantee  that  any  input 
argument  with  the  specified  data  type  is  valid. 

As  an  example  consider  the  function  /  :  Sn  — >  R,  given  by 

f(X)  =  logdetX,  (A. 3) 

with  dom  /  =  S"  +  .  The  notation  /  :  S"  — >  R  specifies  the  syntax  of  /:  it  takes 
as  argument  a  symmetric  n  x  n  matrix,  and  returns  a  real  number.  The  notation 
dom  /  =  S"  ,  specifies  which  symmetric  n  x  n  matrices  are  valid  input  arguments 
for  /  ( i.e .,  only  positive  definite  ones).  The  formula  (A. 3)  specifies  what  f(X)  is, 
for  X  £  dom  /. 

A. 3. 2  Continuity 

A  function  /  :  R"  — >  Rm  is  continuous  at  x  £  dom  /  if  for  all  e  >  0  there  exists  a 
S  such  that 

y  £  dom /,  \\y-x\\2<S  =>  \\f{y)-f(x)\\2<e. 

Continuity  can  be  described  in  terms  of  limits:  whenever  the  sequence  Xi,x2,. . . 
in  dom /  converges  to  a  point  x  £  dom/,  the  sequence  /( xi),  f(x 2), . . .  converges 
to  /( x),  i.e., 

lim  f(xi)  =  /( lim  x^. 

i—>  oo  i— Zoo 

A  function  /  is  continuous  if  it  is  continuous  at  every  point  in  its  domain. 

A. 3. 3  Closed  functions 

A  function  /  :  R"  — >  R  is  said  to  be  closed  if,  for  each  a  £  R,  the  sublevel  set 

(a:  £  dom  /  |  f(x)  <  a} 

is  closed.  This  is  equivalent  to  the  condition  that  the  epigraph  of  /, 
epi /  =  {(#, t)  £  R"+1  |  x  £  dom/,  f(x)  <  t}, 
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is  closed.  (This  definition  is  general,  but  is  usually  only  applied  to  convex  func¬ 
tions.) 

If  /  :  Rn  — >  R  is  continuous,  and  dom  /  is  closed,  then  /  is  closed.  If  /  :  Rn  — > 
R  is  continuous,  with  dom  /  open,  then  /  is  closed  if  and  only  if  /  converges  to  oo 
along  every  sequence  converging  to  a  boundary  point  of  dom  /.  In  other  words,  if 
lim^oo  Xi  =  x  £  bddom/,  with  Xi  £  dom/,  we  have  lim,.^  f(x,)  =  oo. 


Example  A.l  Examples  on  R. 

•  The  function  /  :  R  — >  R,  with  f(x)  =  a; log®,  dom  /  =  R++,  is  not  closed. 

•  The  function  /  :  R  — >  R,  with 

e  /  \  /  xlog*  x  >  0 

fW=\0  x  =  0,  dom  /  =  R+, 

is  closed. 

•  The  function  /(*)  =  —  log®,  dom  /  =  R++,  is  closed. 


A. 4  Derivatives 


A. 4.1  Derivative  and  gradient 


Suppose  /  :  R"  Rm  and  x  £  intdom/.  The  function  /  is  differentiable  at  x  if 
there  exists  a  matrix  Df( x)  £  Rmxn  that  satisfies 


lim 

z£dom  /,  z- 


\\fiz)  ~  fix)  ~  Df(x)(z  —  x)\\z 


=  0, 


z-x  2 


(A.4) 


in  which  case  we  refer  to  Df(x)  as  the  derivative  (or  Jacobian)  of  /  at  x.  (There 
can  be  at  most  one  matrix  that  satisfies  (A.4).)  The  function  /  is  differentiable  if 
dom  /  is  open,  and  it  is  differentiable  at  every  point  in  its  domain. 

The  affine  function  of  2  given  by 


f(x)  +  Df(x)(z-x) 


is  called  the  first-order  approximation  of  /  at  (or  near)  x.  Evidently  this  function 
agrees  with  /  at  z  =  x\  when  z  is  close  to  x,  this  affine  function  is  very  close  to  /. 

The  derivative  can  be  found  by  deriving  the  first-order  approximation  of  the 
function  /  at  x  ( i.e .,  the  matrix  Df(x)  that  satisfies  (A.4)),  or  from  partial  deriva¬ 
tives: 

nf,  x  9fi{x) 

DJ\x)ij=  Qx,  1  1  =  1,...,  TO,  3  =  1  ,...,n. 
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Gradient 

When  /  is  real-valued  ( i.e .,  f  :  R"  R)  the  derivative  Df(x)  is  a  1  x  n  matrix, 
i.e.,  it  is  a  row  vector.  Its  transpose  is  called  the  gradient  of  the  function: 

V/(a)  =  Df(x)T, 

which  is  a  (column)  vector,  i.e.,  in  R".  Its  components  are  the  partial  derivatives 
of  /: 

V/(x)j  =  ,  i  =  l,...,n. 

The  first-order  approximation  of  /  at  a  point  x  G  int  dom  /  can  be  expressed  as 
(the  affine  function  of  z) 

f(x)  +  Xf(x)T(z  —  x). 

Examples 

As  a  simple  example  consider  the  quadratic  function  /  :  Rn  — >  R, 

f(x)  =  (1/2  )xT  Px  +  qTx  +  r, 

where  P  €  S”,  q  G  R”,  and  r  G  R.  Its  derivative  at  x  is  the  row  vector  Df(x )  = 
xTP  +  qT,  and  its  gradient  is 

V  f(x)  =  Px  +  q. 

As  a  more  interesting  example,  we  consider  the  function  /  :  S”  — >  R,  given  by 
f(X)  =  logdet  X,  dom/  =  S"  +  . 

One  (tedious)  way  to  find  the  gradient  of  /  is  to  introduce  a  basis  for  S",  find 
the  gradient  of  the  associated  function,  and  finally  translate  the  result  back  to  S". 
Instead,  we  will  directly  find  the  first-order  approximation  of  /  at  X  €  ®++  •  Let 
Z  G  S”  be  close  to  X,  and  let  AX  =  Z  —  X  (which  is  assumed  to  be  small).  We 
have 

logdet  if  =  logdet  (AT  +  AX) 

=  logdet  (x1/2(/  +  A"1/2AXX-1/2)11/2) 

=  logdet  AT  +  \ogdet(I +  X-1/2AXX~1/2) 

n 

=  logdet  X  +  y^log(l  +  A,), 

i= 1 

where  A i  is  the  ith  eigenvalue  of  X~l/2AX Af-1/2.  Now  we  use  the  fact  that  AA  is 
small,  which  implies  A;  are  small,  so  to  first  order  we  have  log(l  +  A,)  «  A.j.  Using 
this  first-order  approximation  in  the  expression  above,  we  get 

n 

«  log  det  X  +  ^  A i 

i=l 

=  log  det  X  +  tr(Ar-1/2  AXAT”1/2) 

=  log  det  X  +  tr(Al_1AX) 

=  log  det  X  +  tr  (X_1(Z  —  X)) , 


log  det  Z 
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where  we  have  used  the  fact  that  the  sum  of  the  eigenvalues  is  the  trace,  and  the 
property  tr(A.B)  =  tr {BA). 

Thus,  the  first-order  approximation  of  /  at  X  is  the  affine  function  of  Z  given 

by 

f(Z)^  f(X)  +  tr(X~1(Z-X)). 

Noting  that  the  second  term  on  the  righthand  side  is  the  standard  inner  product 
of  X~x  and  Z  —  X,  we  can  identify  X~l  as  the  gradient  of  /  at  X.  Thus,  we  can 
write  the  simple  formula 

V/(X)  =  x-1. 

This  result  should  not  be  surprising,  since  the  derivative  of  logo:,  on  R++,  is  l/x. 


A. 4. 2  Chain  rule 

Suppose  /  :  Rn  — >  Rm  is  differentiable  at  x  £  int  dom  /  and  g  :  Rm  — >  Rp 
is  differentiable  at  f(x)  £  int  dom  g.  Define  the  composition  h  :  R"  — >  Rp  by 
h{z)  =  g(f(z)).  Then  h  is  differentiable  at  x,  with  derivative 

Dh{x)  =  Dg(f(x))Df{x).  (A. 5) 

As  an  example,  suppose  /  :  R"  — >•  R,  g  :  R  — >•  R,  and  h(x)  =  g(f( x)).  Taking 
the  transpose  of  Dh(x)  =  Dg(f(x))Df(x)  yields 

Vh(x)  =  g'(f(x))Vf(x).  (A. 6) 

Composition  with  afFine  function 

Suppose  /  :  Rn  — >  Rm  is  differentiable,  A  £  Rnxp,  and  b  £  Rn.  Define  g  :  Rp  — > 
Rm  as  g(x)  =  f(Ax  +  6),  with  domg  =  {x  |  Ax  +  b  £  dom/}.  The  derivative  of 
g  is,  by  the  chain  rule  (A. 5),  Dg(x)  =  Df(Ax  +  b)A. 

When  /  is  real- valued  (z.e.,  m  =  1),  we  obtain  the  formula  for  the  gradient  of 
a  composition  of  a  function  with  an  affine  function, 

Vg(  x)  =  ATV  f(Ax  +  b). 

For  example,  suppose  that  /  :  R"  — >  R,  x,  v  £  R" ,  and  we  define  the  function 
/  :  R  — R  by  f{t)  =  f(x  +  tv).  (Roughly  speaking,  /  is  /,  restricted  to  the  line 
{x  +  tv  \  t  £  R}.)  Then  we  have 

Df(t)  =  f'(t)  =  X  f{x  +  tv)Tv. 

(The  scalar  /'( 0)  is  the  directional  derivative  of  /,  at  x,  in  the  direction  v.) 


Example  A. 2  Consider  the  function  /  :  Rn  — >  R,  with  dom  /  =  Rn  and 


m 

f{x)  =  logy^exp(qfa;  +  &i), 

i= 1 


<1  pfl 
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where  oi, . . . ,  am  £  Rn,  and  bi, . . . ,  bm  £  R.  We  can  find  a  simple  expression  for 
its  gradient  by  noting  that  it  is  the  composition  of  the  affine  function  Ax  +  b,  where 
A  £  Rmxn  with  rows  aj , . . .  ,  a^,  and  the  function  g  :  Rm  — >  R  given  by  g(y)  = 
l°g(Er— i  exP  Vi)-  Simple  differentiation  (or  the  formula  (A. 6))  shows  that 

exp  j/i 

Vg{y)  =  1 -  :  ,  (A. 7) 

Li=iexP2/i 

exp  i /m 

so  by  the  composition  formula  we  have 

v/(*)  =  yfzaT  z 

where  Zi  =  exp(af  x  +  bi),  i  =  1, ... . . ,  to. 

Example  A. 3  We  derive  an  expression  for  V/( x),  where 

f{x)  =  log  det(7Jo  +  *iFi  H - +xnF„), 

where  Fo,  ■ .  ■ ,  Fn  £  Sp,  and 

dom /  =  {x  £  R"  |  F0  +  xiFi  H - +  >-  0}. 

The  function  /  is  the  composition  of  the  affine  mapping  from  x  £  R’1  to  Fo  +x\Fi  + 
•  •  •  +  xnFn  £  Sp,  with  the  function  logdet  X.  We  use  the  chain  rule  to  evaluate 

=  tr(F;  V  log  det(-F))  = 

OXi 

where  F  =  Fo  +  x\F\  +  •  •  •  +  xnFn.  Thus  we  have 

'  tr(i?_1i?i) 

Vf(x)  =  : 

_  tr {F^Fn) 


A. 4. 3  Second  derivative 

In  this  section  we  review  the  second  derivative  of  a  real-valued  function  /  :  R”  — > 
The  second  derivative  or  Hessian  matrix  of  /  at  x  £  intdom/,  denoted 
2f(x),  is  given  by 

=  qJ.qXJ.  ’  i  =  l,...n,  j  = 

provided  /  is  twice  differentiable  at  x,  where  the  partial  derivatives  are  evaluated 
at  x.  The  second-order  approximation  of  /,  at  or  near  x,  is  the  quadratic  function 
of  2  defined  by 

f(z)  =  f(x)  +  Xf(x)T(z  -x)  +  (1/2 ){z  -  x)TX2f(x)(z  -  x). 
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This  second-order  approximation  satisfies 


lim 

z£dom/,  z^fix,  z—±x 


!/(*) -/(*)! 
\\z-x\\l 


=  0. 


Not  surprisingly,  the  second  derivative  can  be  interpreted  as  the  derivative  of 
the  first  derivative.  If  /  is  differentiable,  the  gradient  mapping  is  the  function 
V/  :  R”  R",  with  domV/  =  dom /,  with  value  V/(s)  at  x.  The  derivative 
of  this  mapping  is 

DVf{x)  =  V2/(s). 


Examples 

As  a  simple  example  consider  the  quadratic  function  /  :  Rra  — >  R, 

f(x)  =  (1/2  )xT  Px  +  qTx  +  r, 

where  P  £  Sra,  q  £  Rn,  and  r  £  R.  Its  gradient  is  Vf(x)  =  Px  +  q,  so  its  Hessian 
is  given  by  V2/(s)  =  P.  The  second-order  approximation  of  a  quadratic  function 
is  itself. 

As  a  more  complicated  example,  we  consider  again  the  function  /  :  Sn  — >  R, 
given  by  f(X)  =  logdet  X,  with  dom  /  =  S"  ,  .  To  find  the  second-order  approxi¬ 
mation  (and  therefore,  the  Hessian),  we  will  derive  a  first-order  approximation  of 
the  gradient,  V/(X)  =  AT-1.  For  Z  £  S”+  near  X  £  S™  +  ,  and  AX  =  Z  —  X,  we 
have 

Z_1  =  (I  +  AI)-1 

=  (x1/2(I  +  X~1/2  AXX~1/2)X1/2J  1 

=  X~^2(I- \-X-1/2AXX~1/2)-1X-1/2 
«  X-1/2(I  ~  X-1/2AXX-1/2)X~1/2 
=  X~*  -X^AXX-1, 

using  the  first-order  approximation  (/  +  A)^1  ss  I  —  A,  valid  for  A  small. 

This  approximation  is  enough  for  us  to  identify  the  Hessian  of  /  at  X.  The 
Hessian  is  a  quadratic  form  on  Sn.  Such  a  quadratic  form  is  cumbersome  to  de¬ 
scribe  in  the  general  case,  since  it  requires  four  indices.  But  from  the  first-order 
approximation  of  the  gradient  above,  the  quadratic  form  can  be  expressed  as 

—  tv{X~1UX~lV), 

where  U,  V  £  S”  are  the  arguments  of  the  quadratic  form.  (This  generalizes  the 
expression  for  the  scalar  case:  (logs)"  =  —1/s2.) 

Now  we  have  the  second-order  approximation  of  /  near  X: 

f(Z )  =  f(X  +  AX) 

«  f{X)  +  tr(X_1AX)  -  (1/2)  trp^AArX^AX) 

«  f{X)  +  tr  (X~\Z  -  X))  -  (1/2)  tr  (X"1^  -  X)X~1(Z  -  X))  . 
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A. 4. 4  Chain  rule  for  second  derivative 

A  general  chain  rule  for  the  second  derivative  is  cumbersome  in  most  cases,  so  we 
will  state  it  only  for  some  special  cases  that  we  will  need. 

Composition  with  scalar  function 

Suppose  /  :  Rn  — »  R,  g  :  R  — >  R,  and  h{x)  =  g(f(x)).  Simply  working  out  the 
partial  derivatives  yields 

V2h(x)  =  g'(f(x))V2f(x)  +  g"{f{x))Wf{x)Vf{x)T.  (A.8) 

Composition  with  affine  function 

Suppose  /  :  R”  4  R,  4  £  Rnxm,  and  b  G  Rn.  Define  g  :  Rm  -4  R  by  g(x)  = 
f(Ax  +  b).  Then  we  have 

V2g(x)  =  ATV2f(Ax  +  b)A. 

As  an  example,  consider  the  restriction  of  a  real- valued  function  /  to  a  line,  i.e., 
the  function  /(f)  =  f(x  +  tv),  where  x  and  v  are  fixed.  Then  we  have 

v2/(f)  =  /"(f)  =  vTV2f(x  +  tv)v. 

Example  A. 4  We  consider  the  function  /  :  Rn  -4  R  from  example  A. 2, 

m 

f(x)  =  log  ^2  exp  (of  x  +  bi), 

i=l 

where  a\ , . . . ,  am  G  R15 ,  and  6i , . . . ,  bm  G  R.  By  noting  that  f(x)  =  g(Ax  +  b ) ,  where 
g(y)  =  log(^f  ™_1  exp  j/i),  we  can  obtain  a  simple  formula  for  the  Hessian  of  /.  Taking 
partial  derivatives,  or  using  the  formula  (A.8),  noting  that  g  is  the  composition  of 
log  with  !  exP  Vi  i  yields 

V2g{y)  =  diag (Vff(y))  -  V g(y)X7 g(y)T , 

where  Vg(y)  is  given  in  (A. 7).  By  the  composition  formula  we  have 

V2/(s)  =  AT  diag(-)  -  (1lzyzzT^j  A’ 

where  Zi  =  exp(af  x  +  bi),  i  =  1, . . . ,  m. 

A. 5  Linear  algebra 

A. 5.1  Range  and  nullspace 

Let  A  G  Rmxn  (f.e.,  A  is  a  real  matrix  with  m  rows  and  n  columns).  The  range 
of  A,  denoted  1Z{A),  is  the  set  of  all  vectors  in  Rm  that  can  be  written  as  linear 
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combinations  of  the  columns  of  A ,  i.e., 


K{A)  =  {Ax  |  x  G  R”}. 


The  range  7 Z(A)  is  a  subspace  of  Rm,  i.e.,  it  is  itself  a  vector  space.  Its  dimension 
is  the  rank  of  A,  denoted  rank  A.  The  rank  of  A  can  never  be  greater  than  the 
minimum  of  m  and  n.  We  say  A  has  full  rank  if  rank  A  =  min {m,  n}. 

The  nullspace  (or  kernel)  of  A,  denoted  Af(A),  is  the  set  of  all  vectors  x  mapped 
into  zero  by  A: 

A f(A)  =  {x  |  Ax  =  0}. 

The  nullspace  is  a  subspace  of  Rn. 


Orthogonal  decomposition  induced  by  A 

If  V  is  a  subspace  of  R”,  its  orthogonal  complement,  denoted  V±,  is  defined  as 

=  {x  |  zTx  =  0  for  all  2  G  V}. 

(As  one  would  expect  of  a  complement,  we  have  V±J~  =  V.) 

A  basic  result  of  linear  algebra  is  that,  for  any  A  G  Rmxn,  we  have 

AT  (A)  =TZ(At)±. 

(Applying  the  result  to  AT  we  also  have  7 Z(A)  =  Af(AT)-L.)  This  result  is  often 
stated  as 

A f{A)  ©  7 Z(AT)  =  R".  (A. 9) 

_l 

Here  the  symbol  ©  refers  to  orthogonal  direct  sum,  i.e.,  the  sum  of  two  subspaces 
that  are  orthogonal.  The  decomposition  (A. 9)  of  R™  is  called  the  orthogonal  de¬ 
composition  induced  by  A. 


A. 5. 2  Symmetric  eigenvalue  decomposition 

Suppose  A  G  S",  i.e.,  A  is  a  real  symmetric  n  x  n  matrix.  Then  A  can  be  factored 
as 

A  =  QAQt,  (A. 10) 

where  Q  G  R"xn  is  orthogonal,  i.e.,  satisfies  QTQ  =  I,  and  A  =  diag(Ai, . . . ,  A„). 
The  (real)  numbers  A;  are  the  eigenvalues  of  A,  and  are  the  roots  of  the  charac¬ 
teristic  polynomial  det (si  —  A).  The  columns  of  Q  form  an  orthonormal  set  of 
eigenvectors  of  A.  The  factorization  (A.  10)  is  called  the  spectral  decomposition  or 
(symmetric)  eigenvalue  decomposition  of  A. 

We  order  the  eigenvalues  as  Ai  >  A2  >  •  •  •  >  A„.  We  use  the  notation  A,;  (A) 
to  refer  to  the  ith  largest  eigenvalue  of  A  G  S.  We  usually  write  the  largest  or 
maximum  eigenvalue  as  Ai(A)  =  Amax(A),  and  the  least  or  minimum  eigenvalue  as 
^n(A)  =  Amin(A). 
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The  determinant  and  trace  can  be  expressed  in  terms  of  the  eigenvalues, 

n  n 

detA  =  J^Ai,  tr  A  =  Xj, 

i=  1  i— 1 


as  can  the  spectral  and  Frobenius  norms, 


||A||2  =  max  |A,|  =  max{Ai, -An}, 


Definiteness  and  matrix  inequalities 

The  largest  and  smallest  eigenvalues  satisfy 

rT  At  tt  At 

Amax  (A)  =  sup  Pp ,  Amin  (A)  =  inf  pP . 

xjio  X1  X  x^O  X1  X 

In  particular,  for  any  x,  we  have 

Amin(A)x  X  A  X  Ax  ^  Amax(A)x  X, 


with  both  inequalities  tight  for  (different)  choices  of  x. 

A  matrix  A  €  Sn  is  called  positive  definite  if  for  all  x  ^  0,  xT Ax  >  0.  We 
denote  this  as  A  >~  0.  By  the  inequality  above,  we  see  that  A  >-  0  if  and  only  all 
its  eigenvalues  are  positive,  i.e.,  Am;n(A)  >  0.  If  —A  is  positive  definite,  we  say  A 
is  negative  definite,  which  we  write  as  A  -<  0.  We  use  S”+  to  denote  the  set  of 
positive  definite  matrices  in  S". 

If  A  satisfies  xT Ax  >  0  for  all  x,  we  say  that  A  is  positive  semidefinite  or 
nonnegative  definite.  If  —A  is  nonnegative  definite,  i.e.,  if  xT Ax  <  0  for  all  x,  we 
say  that  A  is  negative  semidefinite  or  nonpositive  definite.  We  use  to  denote 
the  set  of  nonnegative  definite  matrices  in  Sn. 

For  A,B  €  S",  we  use  A  -<  B  to  mean  B  —  A  >-  0,  and  so  on.  These  inequal¬ 
ities  are  called  matrix  inequalities,  or  generalized  inequalities  associated  with  the 
positive  semidefinite  cone. 


Symmetric  squareroot 

Let  A  €  S"  ,  with  eigenvalue  decomposition  A  =  Q  diag(Ai, . . . ,  A n)QT ■  We  define 
the  (symmetric)  squareroot  of  A  as 

A1/2  =  Q  diag(A}/2, . . . ,  A  1J2)QT ■ 

The  squareroot  A1/2  is  the  unique  symmetric  positive  semidefinite  solution  of  the 
equation  X2  =  A. 


A. 5. 3  Generalized  eigenvalue  decomposition 

The  generalized  eigenvalues  of  a  pair  of  symmetric  matrices  ( A,B )  €  S”  x  Sn  are 
defined  as  the  roots  of  the  polynomial  det(sl?  —  A). 
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We  are  usually  interested  in  matrix  pairs  with  B  £  S”  ,  .  In  this  case  the 
generalized  eigenvalues  are  also  the  eigenvalues  of  I?-1/2 AB~X!2  (which  are  real). 
As  with  the  standard  eigenvalue  decomposition,  we  order  the  generalized  eigen¬ 
values  in  nonincreasing  order,  as  Ai  >  A2  >  •  •  •  >  An,  and  denote  the  maximum 
generalized  eigenvalue  by  Amax(A,  B). 

When  B  £  S"  +  ,  the  pair  of  matrices  can  be  factored  as 

A  =  VAVt ,  B  =  VVT1  (A. 11) 

where  V  £  Rnxn  is  nonsingular,  and  A  =  diag(Ai, . . . ,  A„),  where  A i  are  the 
generalized  eigenvalues  of  the  pair  ( A,B ).  The  decomposition  (A.  11)  is  called  the 
generalized  eigenvalue  decomposition. 

The  generalized  eigenvalue  decomposition  is  related  to  the  standard  eigenvalue 
decomposition  of  the  matrix  B~1/2AB~1/2.  If  QAQT  is  the  eigenvalue  decompo¬ 
sition  of  B-1/2AB~1/2,  then  (A. 11)  holds  with  V  =  BX!2Q. 

A. 5. 4  Singular  value  decomposition 

Suppose  A  £  Rm  x  n  with  rank  A  =  r.  Then  A  can  be  factored  as 

A  =  UEVt,  (A. 12) 

where  U  £  Rmxr  satisfies  UTU  =  I,  V  £  Rnxr  satisfies  VTV  =  /,  and  E  = 
diag(crl7 . . . ,  ay ),  with 

ay  >  <J2  >  •  •  •  >  ar  >  0. 

The  factorization  (A.  12)  is  called  the  singular  value  decomposition  (SVD)  of  A. 
The  columns  of  U  are  called  left  singidar  vectors  of  A,  the  columns  of  V  are  right 
singular  vectors ,  and  the  numbers  oy  are  the  singular  values.  The  singular  value 
decomposition  can  be  written 


d  =  y  (TiUivJ , 

i- 1 


where  Ui  £  Rm  are  the  left  singular  vectors,  and  ty  £  R"  are  the  right  singular 
vectors. 

The  singular  value  decomposition  of  a  matrix  A  is  closely  related  to  the  eigen¬ 
value  decomposition  of  the  (symmetric,  nonnegative  definite)  matrix  AT A.  Us¬ 
ing  (A.  12)  we  can  write 


ArA  =  VY?Vt  =  [  V  V 


E2  0 
0  0 


r  -  ,T 

[V  V]  , 


where  V  is  any  matrix  for  which  [V  V]  is  orthogonal.  The  righthand  expression  is 
the  eigenvalue  decomposition  of  AT A1  so  we  conclude  that  its  nonzero  eigenvalues 
are  the  singular  values  of  A  squared,  and  the  associated  eigenvectors  of  AT  A  are 
the  right  singular  vectors  of  A.  A  similar  analysis  of  AAT  shows  that  its  nonzero 
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eigenvalues  are  also  the  squares  of  the  singular  values  of  A,  and  the  associated 
eigenvectors  are  the  left  singular  vectors  of  A. 

The  first  or  largest  singular  value  is  also  written  as  crmax(A).  It  can  be  expressed 


as 


^max(-I)  =  Sup 


rAy 


Uvh 


o^oWxhhh  y/o  1 1 2/ 1 1 2  • 

The  righthand  expression  shows  that  the  maximum  singular  value  is  the  ii  operator 
norm  of  A.  The  minimum  singular  value  of  A  £  Rmxn  js  given  by 


crm\r>(A) 


oy(A)  r  =  min {m,  n} 
0  r<min{m,n}, 


which  is  positive  if  and  only  if  A  is  full  rank. 

The  singular  values  of  a  symmetric  matrix  are  the  absolute  values  of  its  nonzero 
eigenvalues,  sorted  into  descending  order.  The  singular  values  of  a  symmetric 
positive  semidefinite  matrix  are  the  same  as  its  nonzero  eigenvalues. 

The  condition  number  of  a  nonsingular  A  £  Rnx",  denoted  cond(A)  or  re(A), 
is  defined  as 

COIld(A)  =  ||A||2||A_1||2  =  <7max(A)/crmin(A). 


Pseudo-inverse 

Let  A  =  UY,VT  be  the  singular  value  decomposition  of  A  £  Rmx",  with  rank  A  = 
r.  We  define  the  pseudo-inverse  or  Moore-Penrose  inverse  of  A  as 

A*  =  EE _1f7T  £  Rnxm. 


Alternative  expressions  are 

Af  =  lim (ATA  +  el)-1  AT  =  lim  AT(AAT  +  el)-1, 

e-X)  e— s-0 

where  the  limits  are  taken  with  e  >  0,  which  ensures  that  the  inverses  in  the 
expressions  exist.  If  rank  A  =  n,  then  A I  =  (ATA)-1AT.  If  rank  A  =  m,  then 
Al  =  AT(AAT)-1.  If  A  is  square  and  nonsingular,  then  At  =  A-1. 

The  pseudo-inverse  comes  up  in  problems  involving  least-squares,  minimum 
norm,  quadratic  minimization,  and  (Euclidean)  projection.  For  example,  A^b  is  a 
solution  of  the  least-squares  problem 


minimize  ||  Ax  —  6||| 

in  general.  When  the  solution  is  not  unique,  A'b  gives  the  solution  with  minimum 
(Euclidean)  norm.  As  another  example,  the  matrix  A  A'!  =  UUT  gives  (Euclidean) 
projection  on  TZ(A).  The  matrix  A^A  =  VVT  gives  (Euclidean)  projection  on 
1Z(AT). 

The  optimal  value  p*  of  the  (general,  nonconvex)  quadratic  optimization  prob¬ 
lem 

minimize  (1/2  )xT  Px  +  qTx  +  r, 
where  P  £  S",  can  be  expressed  as 


P 


★ 


-{l/2)qTP'q  +  r  PP  0,  q  £  K(P) 
—oo  otherwise. 


(This  generalizes  the  expression  p*  =  —(1/2 )qT P  1q  +  r,  valid  for  P  >-  0.) 
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A. 5. 5  Schur  complement 


Consider  a  matrix  X  G  S”  partitioned  as 

y  _  [  A  B 
x~[bt  C  \  ’ 

where  A  G  Sfc.  If  det  A  ^  0,  the  matrix 

S  =  C  -  BtA~1B 

is  called  the  Schur  complement  of  A  in  X.  Schur  complements  arise  in  several 
contexts,  and  appear  in  many  important  formulas  and  theorems.  For  example,  we 
have 

det  X  =  det  A  det  S. 


Inverse  of  block  matrix 

The  Schur  complement  comes  up  in  solving  linear  equations,  by  eliminating  one 
block  of  variables.  We  start  with 


A 

B  ' 

X 

u 

Bt 

G 

.  y . 

V 

and  assume  that  det  A  ^  0.  If  we  eliminate  x  from  the  top  block  equation  and 
substitute  it  into  the  bottom  block  equation,  we  obtain  v  =  BTA~1u  +  Sy ,  so 

y  =  S~1(v-BTA-1u). 

Substituting  this  into  the  first  equation  yields 

x  =  (A-1  +  A~1BS~1BtA-1)  u  -  A~1BS~1v. 

We  can  express  these  two  equations  as  a  formula  for  the  inverse  of  a  block  matrix: 


A  B 

-1 

-  a~i  +  a-ibs-ibta-i 

—A~1BS~1  ' 

Bt  c 

- S~1BTA~ 1 

s-1 

In  particular,  we  see  that  the  Schur  complement  is  the  inverse  of  the  2,  2  block 
entry  of  the  inverse  of  X. 


Minimization  and  definiteness 


The  Schur  complement  arises  when  you  minimize  a  quadratic  form  over  some  of 
the  variables.  Suppose  A  >-  0,  and  consider  the  minimization  problem 


minimize  uT  Au  +  2vT  BTu  +  vTCv 
with  variable  u.  The  solution  is  u  =  —A~1Bv,  and  the  optimal  value  is 


inf 


U 

T 

A 

B  ' 

u 

V 

Bt 

C 

V 

=  vTSv. 


(A.13) 


(A. 14) 


From  this  we  can  derive  the  following  characterizations  of  positive  definiteness  or 
semidefiniteness  of  the  block  matrix  X: 
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•  X  >~  0  if  and  only  if  A  >-  0  and  S  >-  0. 

•  If  A  >-  0,  then  X  >z  0  if  and  only  if  S  >z  0. 

Schur  complement  with  singular  A 

Some  Schur  complement  results  have  generalizations  to  the  case  when  A  is  singular, 
although  the  details  are  more  complicated.  As  an  example,  if  A  ^  0  and  Bv  £ 
1Z(A),  then  the  quadratic  minimization  problem  (A.  13)  (with  variable  u)  is  solvable, 
and  has  optimal  value 

vT(C  -  BtA^B)v, 

where  A t  is  the  pseudo-inverse  of  A.  The  problem  is  unbounded  if  Bv  ^  B(A)  or 
if  A  t  0. 

The  range  condition  Bv  £  B(A)  can  also  be  expressed  as  (I  —  AA')Bv  =  0, 
so  we  have  the  following  characterization  of  positive  semidefiniteness  of  the  block 
matrix  X: 


A^0,  {I~AA*!)B  =  0,  C  -  BTA*B  A0- 

Here  the  matrix  C  —  BTA^B  serves  as  a  generalization  of  the  Schur  complement, 
when  A  is  singular. 
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Appendix  B 


Problems  involving  two 
quadratic  functions 


In  this  appendix  we  consider  some  optimization  problems  that  involve  two  quadratic, 
but  not  necessarily  convex,  functions.  Several  strong  results  hold  for  these  prob¬ 
lems,  even  when  they  are  not  convex. 


B.l  Single  constraint  quadratic  optimization 


We  consider  the  problem  with  one  constraint 

minimize  xT  Agx  +  2bgX  +  Co  , 

subject  to  xT A\x  +  2bjx  +  c\  <  0,  ' 

with  variable  x  £  R”,  and  problem  parameters  Ai  £  S",  bi  £  R”,  Ci  £  R.  We  do 
not  assume  that  Ai  t  0,  so  problem  (B.l)  is  not  a  convex  optimization  problem. 
The  Lagrangian  of  (B.l)  is 

T(x,  A)  —  x^  {Ao  -f-  XAi^x  T  2{bg  T  Xb^y^  x  T  co  T  Aci, 
and  the  dual  function  is 


g{ A)  =  inf  L(x,  A) 

X 

(  Co  +  Aci  —  (b0  +  Xbi)^ {Aq  +  AAi)^(6o  +  Xb\)  Aq  +  XAi  (y  0, 

=  \  i*o  1  Xb[  £  7Z(Ag  +  AA4) 

I  —oo  otherwise 


(see  §A.5.4).  Using  a  Schur  complement,  we  can  express  the  dual  problem  as 


maximize  7 
subject  to  A  >  0 

Ao  +  XAi  b0  +  Xbi 
{bo  +  Xbx)T  c0  +  Aci  -  7 


to, 


(B.2) 
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an  SDP  with  two  variables  7,  A  £  R. 

The  first  result  is  that  strong  duality  holds  for  problem  (B.l)  and  its  Lagrange 
dual  (B.2),  provided  Slater’s  constraint  qualification  is  satisfied,  i.e.,  there  exists 
an  x  with  xT A\x  +  2b^x  +  C\  <  0.  In  other  words,  if  (B.l)  is  strictly  feasible,  the 
optimal  values  of  (B.l)  and  (B.2)  are  equal.  (A  proof  is  given  in  §B.4.) 


Relaxation  interpretation 

The  dual  of  the  SDP  (B.2)  is 


minimize  tr(AoX)  +  2b^x  +  Co 
subject  to  tr(AiX)  +  2bjx  +  C\  <  0 


(B.3) 


an  SDP  with  variables  X  £  S",  x  £  R".  This  dual  SDP  has  an  interesting 
interpretation  in  terms  of  the  original  problem  (B.l). 

We  first  note  that  (B.l)  is  equivalent  to 

minimize  tr(AoX)  +  2  b^x  +  cq 

subject  to  tr(AiX)  +  2 bjx  +  c\  <  0  (B.4) 

X  =  xxT . 


In  this  formulation  we  express  the  quadratic  terms  xT AiX  as  tr(A;;ra;T),  and  then 
introduce  a  new  variable  X  =  xxT .  Problem  (B.4)  has  a  linear  objective  function, 
one  linear  inequality  constraint,  and  a  nonlinear  equality  constraint  X  =  xxT .  The 
next  step  is  to  replace  the  equality  constraint  by  an  inequality  X  A  xxT : 

minimize  tr(A0X)  +  b^x  +  Cq 

subject  to  tr(AiX)  +  b\x  +  C\  <  0  (B.5) 

X  A  xxT . 

This  problem  is  called  a  relaxation  of  (B.4),  since  we  have  replaced  one  of  the 
constraints  with  a  looser  constraint.  Finally  we  note  that  the  inequality  in  (B.5) 
can  be  expressed  as  a  linear  matrix  inequality  by  using  a  Schur  complement,  which 
gives  (B.3). 

A  number  of  interesting  facts  follow  immediately  from  this  interpretation  of  (B.3) 
as  a  relaxation  of  (B.l).  First,  it  is  obvious  that  the  optimal  value  of  (B.3)  is  less 
than  or  equal  to  the  optimal  value  of  (B.l),  since  we  minimize  the  same  objec¬ 
tive  function  over  a  larger  set.  Second,  we  can  conclude  that  if  X  =  xxT  at  the 
optimum  of  (B.3),  then  x  must  be  optimal  in  (B.l). 

Combining  the  result  above,  that  strong  duality  holds  between  (B.l)  and  (B.2) 
(if  (B.l)  is  strictly  feasible),  with  strong  duality  between  the  dual  SDPs  (B.2) 
and  (B.3),  we  conclude  that  strong  duality  holds  between  the  original,  nonconvex 
quadratic  problem  (B.l),  and  the  SDP  relaxation  (B.3),  provided  (B.l)  is  strictly 
feasible. 
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B.2  The  S-procedure 


The  next  result  is  a  theorem  of  alternatives  for  a  pair  of  (nonconvex)  quadratic 
inequalities.  Let  Ax,  A2  £  Sn,  b%,  b2  6  R",  Ci,  C2  £  R,  and  suppose  there  exists  an 
x  with 

xTA2x  +  2 b2x  +  c2  <  0. 

Then  there  exists  an  a;  £  R™  satisfying 

xTAiX  +  2b^x  +  Ci  <  0,  xT A2x  +  2b2x  +  c2  <  0,  (B.6) 


if  and  only  if  there  exists  no  A  such  that 


A  >  0, 


'  Ax 

bx  ' 

+  A 

A'2  b2 

[  % 

cr 

b2  c2  _ 

y  0. 


(B.7) 


In  other  words,  (B.6)  and  (B.7)  are  strong  alternatives. 

This  result  is  readily  shown  to  be  equivalent  to  the  result  from  §B.l,  and  a  proof 
is  given  in  §B.4.  Here  we  point  out  that  the  two  inequality  systems  are  clearly  weak 
alternatives,  since  (B.6)  and  (B.7)  together  lead  to  a  contradiction: 


X 

T  ( 

\  A1 

bi  ' 

+  A 

A.2 

b2  ' 

\ 

X 

1 

\ 

k 

Cl 

[  bT2 

c2 

) 

1 

=  xT Aix  +  2bi  x  +  ci  +  X(xT A2x  +  2bJ x  +  c2) 
<  0. 


This  theorem  of  alternatives  is  sometimes  called  the  S-procedure ,  and  is  usually 
stated  in  the  following  form:  the  implication 

xT Fix  +  2gfx  +  hi  <  0  =>  xTF2x  +  2g^x  +  h2  <  0, 


where  Ft  £  S",  gi  £  Rra,  hi  £  R,  holds  if  and  only  if  there  exists  a  A  such  that 


A  >  0, 


F2  g2 
92  h2 


A  A 


Fi  gi 

9f  h! 


provided  there  exists  a  point  x  with  xT F\x  +  2gf  x  +  hi  <  0.  (Note  that  sufficiency 
is  clear.) 


Example  B.l  Ellipsoid  containment.  An  ellipsoid  £  C  Rn  with  nonempty  interior 
can  be  represented  as  the  sublevel  set  of  a  quadratic  function, 

£  =  {x  \  xT Fx  +  2 gT x  +  h  <  0}, 

where  F  £  S++  and  h  —  gT F~1g  <  0.  Suppose  £  is  another  ellipsoid  with  similar 
representation, 

£  =  {x  |  xT Fx  +  2 gT x  +  h  <  0}, 

with  F  £  S++,  h.  —  gT F^1^  <  0.  By  the  S-procedure,  we  see  that  £  C  £  if  and  only 
if  there  is  a  A  >  0  such  that 


F 

-T 


9 

h. 


656 


B  Problems  involving  two  quadratic  functions 


B.3  The  field  of  values  of  two  symmetric  matrices 

The  following  result  is  the  basis  for  the  proof  of  the  strong  duality  result  in  §B.l 
and  the  S-procedure  in  §B.2.  If  A,  B  £  S”,  then  for  all  X  £  S",  there  exists  an 
x  £  Rn  such  that 

xT  Ax  =  tr(AX'),  xTBx  =  tr(BX).  (B.8) 


Remark  B.l  Geometric  interpretation.  This  result  has  an  interesting  interpretation 
in  terms  of  the  set 

W(A,B)  =  {( xtAx,xtBx )  |  x  £  R"}, 
which  is  a  cone  in  R2.  It  is  the  cone  generated  by  the  set 

F(A,B)  =  {{xT Ax,  xT Bx)  |  ||*||2  =  1}, 

which  is  called  the  2-dimensional  field  of  values  of  the  pair  ( A,  B ).  Geometrically, 
W(A,  B)  is  the  image  of  the  set  of  rank-one  positive  semidefinite  matrices  under  the 
linear  transformation  /  :  Sn  — >  R2  defined  by 

f(X)  =  (tr(AX),tr(BX)). 

The  result  that  for  every  X  £  S+  there  exists  an  x  satisfying  (B.8)  means  that 

W(A,B)  =  f(Srf). 

In  other  words,  W(A,B)  is  a  convex  cone. 


The  proof  is  constructive  and  uses  induction  on  the  rank  of  X.  Suppose  it  is 
true  for  all  X  £  S"  with  1  <  rank  At  <  k ,  where  k  >  2,  that  there  exists  an  x  such 
that  (B.8)  holds.  Then  the  result  also  holds  if  rank  At  =  k  +  1,  as  can  be  seen  as 
follows.  A  matrix  X  £  S"  with  rank  X  =  k  +  1  can  be  expressed  as  X  =  yyT  +  Z 
where  y  ^  0  and  Z  £  S"  with  rank  Z  =  k.  By  assumption,  there  exists  a  z  such 
that  tr (AZ)  =  zTAz ,  tr (AZ)  =  zT Bz.  Therefore 

tr  (AX)  =  tr  (A(yyT  +  zzT)),  tr  {BX)  =  tr  {B(yyT  +  zzT)). 

The  rank  of  yyT  +  zzT  is  one  or  two,  so  by  assumption  there  exists  an  x  such 
that  (B.8)  holds. 

It  is  therefore  sufficient  to  prove  the  result  if  rankX  <  2.  If  rankAl  =  0  and 
rank  =  1  there  is  nothing  to  prove.  If  rank  X  =  2,  we  can  factor  X  as  X  =  VVT 
where  V  £  Rnx2,  with  linearly  independent  columns  V\  and  v^-  Without  loss  of 
generality  we  can  assume  that  VT AV  is  diagonal.  (If  VTAV  is  not  diagonal  we 
replace  V  with  VP  where  VTAV  =  Pdiag(A)PT  is  the  eigenvalue  decomposition 
of  VT AV.)  We  will  write  VT AV  and  VTBV  as 


vtav  = 

Ai 

0 

,  vtbv = 

7 

0 

A2 

'y 

(72 

and  define 


tr  {AX) 

Ai  +  A2 

tr  (BX) 

0 1  +  ct2 
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We  need  to  show  that  w  =  (xT  Ax,  xT Bx)  for  some  x. 

We  distinguish  two  cases.  First,  assume  (0,7)  is  a  linear  combination  of  the 
vectors  (Ai,cti)  and  (A2,ct2): 


0  =  27A1  +  z2\2,  1  =  Z\V\  + z2o2, 


for  some  z\,  z2.  In  this  case  we  choose  x  =  avi  +  0v2 ,  where  a  and  0  are  determined 
by  solving  two  quadratic  equations  in  two  variables 


a2  +  2a0z\  =  1,  02  +  2  a0z2  =  1. 

This  will  give  the  desired  result,  since 

(mq  +  0v2)TA(av1  +  f3v2) 

(mq  +  f5v2)TB(avi  +  0v2) 


A2 
ct2 

Ai  +  A2 

<71  +  cr2 


=  a 


Ai 


2af3 


0 

7 


=  (a2  +  2a(3zi) 


Ai 


+  /32  A2 

C2 

(, P 2  +  2a/322) 


(B.9) 


It  remains  to  show  that  the  equations  (B.9)  are  solvable.  To  see  this,  we  first  note 
that  a  and  /3  must  be  nonzero,  so  we  can  write  the  equations  equivalently  as 

a2(l  +  2{f3/a)zi)  =  1,  (P/a)2  +  2(0/ a) (z2  -  Zi)  =  1. 

The  equation  t2  +  2 t(z2  —  Z\)  =  1  has  a  positive  and  a  negative  root.  At  least  one 
of  these  roots  (the  root  with  the  same  sign  as  z\)  satisfies  1  +  2tz\  >  0,  so  we  can 
choose 

a  =  ±l/\/l  +  2tz\,  0  =  to. 

This  yields  two  solutions  ( a ,  /3)  that  satisfy  (B.9).  (If  both  roots  of  t2+2t(z2  —  Z\)  = 
1  satisfy  1  +  2tz\  >  0,  we  obtain  four  solutions.) 

Next,  assume  that  (0, 7)  is  not  a  linear  combination  of  (Ai,  07)  and  (A2,  a2).  In 
particular,  this  means  that  (Ai,cri)  and  (A 2,cr2)  are  linearly  dependent.  Therefore 
their  sum  w  =  (Ai  +  A2,  cri  +  a2)  is  a  nonnegative  multiple  of  (Ai,  tri),  or  (A2,  cr2), 
or  both.  If  w  =  a2(Ai,cri)  for  some  a ,  we  can  choose  x  =  av\.  If  w  =  /32(A2,ct2) 
for  some  0,  we  can  choose  x  =  0v2. 


B.4  Proofs  of  the  strong  duality  results 

We  first  prove  the  S-procedure  result  given  in  §B.2.  The  assumption  of  strict 
feasibility  of  x  implies  that  the  matrix 

A2  b2 
b2  c2 
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has  at  least  one  negative  eigenvalue.  Therefore 


t  >  0,  r 


t  o 


T  =  0. 


We  can  apply  the  theorem  of  alternatives  for  nonstrict  linear  matrix  inequalities, 
given  in  example  5.14,  which  states  that  (B.7)  is  infeasible  if  and  only  if 


IbO, 


tr 


A1  b\ 
bl  a 


<o, 


tr 


A2  b2 
b2  c2 


<  0 


is  feasible.  From  §B.3  this  is  equivalent  to  feasibility  of 


V 

T 

r  a, 

h  ' 

V 

<0, 

V 

T 

to 

b2  ' 

V 

w 

_ 1 

Cl 

W 

w 

1 

_ l 

c2 

w 

If  w  ^  0,  then  x  =  v/w  is  feasible  in  (B.6).  If  w  =  0,  we  have  vT A\v  <  0, 
vTA2v  <  0,  so  x  =  x  +  tv  satisfies 


xtAix  +  2bi  x  +  ci 
xTA2x  +  2  b2x  +  c2 


=  xtAix  +  2  bfx  +  ci  +  t2vTAiv  +  2t{A\x  +  b{)T  v 
=  xT  A2x  +  2  b^x  +  c2  +  t2vT  A2v  +  2t(A2x  +  b2)T  v 
<  2t(A2x  +  b2)Tv, 


be.,  x  becomes  feasible  as  t  — >  ±00,  depending  on  the  sign  of  {A2x  +  b2)Tv. 

Finally,  we  prove  the  result  in  §B.l,  be.,  that  the  optimal  values  of  (B.l) 
and  (B.2)  are  equal  if  (B.l)  is  strictly  feasible.  To  do  this  we  note  that  7  is  a 

if 

xtA0x  +  b^x  +  Cg  >7. 


lower  bound  for  the  optimal  value  of  (B.l) 


xT A\X  +  x  +  Ci  <  0 


By  the  S-procedure  this  is  true  if  and  only  if  there  exists  a  A  >  0  such  that 


Aq 

[ 


+  A 

'A  1  61  ' 

uT 

bi  Cl 

to, 


be.,  7,  A  are  feasible  in  (B.2). 
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Appendix  C 

Numerical  linear  algebra 
background 


In  this  appendix  we  give  a  brief  overview  of  some  basic  numerical  linear  algebra, 
concentrating  on  methods  for  solving  one  or  more  sets  of  linear  equations.  We  focus 
on  direct  ( i.e .,  noniterative)  methods,  and  how  problem  structure  can  be  exploited 
to  improve  efficiency.  There  are  many  important  issues  and  methods  in  numerical 
linear  algebra  that  we  do  not  consider  here,  including  numerical  stability,  details 
of  matrix  factorizations,  methods  for  parallel  or  multiple  processors,  and  iterative 
methods.  For  these  (and  other)  topics,  we  refer  the  reader  to  the  references  given 
at  the  end  of  this  appendix. 


C.l  Matrix  structure  and  algorithm  complexity 

We  concentrate  on  methods  for  solving  the  set  of  linear  equations 


Ax  =  b 


(C.l) 


where  A  £  R"xrl  anc[  5  g  R".  We  assume  A  is  nonsingular,  so  the  solution  is 
unique  for  all  values  of  b ,  and  given  by  x  =  A~1b.  This  basic  problem  arises  in 
many  optimization  algorithms,  and  often  accounts  for  most  of  the  computation.  In 
the  context  of  solving  the  linear  equations  (C.l),  the  matrix  A  is  often  called  the 
coefficient  matrix,  and  the  vector  b  is  called  the  righthand  side. 

The  standard  generic  methods  for  solving  (C.l)  require  a  computational  effort 
that  grows  approximately  like  n3.  These  methods  assume  nothing  more  about  A 
than  nonsingularity,  and  so  are  generally  applicable.  For  n  several  hundred  or 
smaller,  these  generic  methods  are  probably  the  best  methods  to  use,  except  in  the 
most  demanding  real-time  applications.  For  n  more  than  a  thousand  or  so,  the 
generic  methods  of  solving  Ax  =  b  become  less  practical. 
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Coefficient  matrix  structure 

In  many  cases  the  coefficient  matrix  A  has  some  special  structure  or  form  that  can 
be  exploited  to  solve  the  equation  Ax  =  b  more  efficiently,  using  methods  tailored 
for  the  special  structure.  For  example,  in  the  Newton  system  V2/(x) Axnt  = 
— V/(x),  the  coefficient  matrix  is  symmetric  and  positive  definite,  which  allows  us 
to  use  a  solution  method  that  is  around  twice  as  fast  as  the  generic  method  (and 
also  has  better  roundoff  properties) .  There  are  many  other  types  of  structure  that 
can  be  exploited,  with  computational  savings  (or  algorithm  speedup)  that  is  usually 
far  more  than  a  factor  of  two.  In  many  cases,  the  effort  is  reduced  to  something 
proportional  to  n2  or  even  n,  as  compared  to  n 3  for  the  generic  methods.  Since 
these  methods  are  usually  applied  when  n  is  at  least  a  hundred,  and  often  far  larger, 
the  savings  can  be  dramatic. 

A  wide  variety  of  coefficient  matrix  structures  can  be  exploited.  Simple  exam¬ 
ples  related  to  the  sparsity  pattern  (i.e.,  the  pattern  of  zero  and  nonzero  entries 
in  the  matrix)  include  banded,  block  diagonal,  or  sparse  matrices.  A  more  subtle 
exploitable  structure  is  diagonal  plus  low  rank.  Many  common  forms  of  convex 
optimization  problems  lead  to  linear  equations  with  coefficient  matrices  that  have 
these  exploitable  structures.  (There  are  many  other  matrix  structures  that  can  be 
exploited,  e.g.,  Toeplitz,  Hankel,  and  circulant,  that  we  will  not  consider  in  this 
appendix.) 

We  refer  to  a  generic  method  that  does  not  exploit  any  sparsity  pattern  in  the 
matrices  as  one  for  dense  matrices.  We  refer  to  a  method  that  does  not  exploit  any 
structure  at  all  in  the  matrices  as  one  for  unstructured  matrices. 


C.1.1  Complexity  analysis  via  flop  count 

The  cost  of  a  numerical  linear  algebra  algorithm  is  often  expressed  by  giving  the 
total  number  of  floating-point  operations  or  flops  required  to  carry  it  out,  as  a 
function  of  various  problem  dimensions.  We  define  a  flop  as  one  addition,  sub¬ 
traction,  multiplication,  or  division  of  two  floating-point  numbers.  (Some  authors 
define  a  flop  as  one  multiplication  followed  by  one  addition,  so  their  flop  counts 
are  smaller  by  a  factor  up  to  two.)  To  evaluate  the  complexity  of  an  algorithm,  we 
count  the  total  number  of  flops,  express  it  as  a  function  (usually  a  polynomial)  of 
the  dimensions  of  the  matrices  and  vectors  involved,  and  simplify  the  expression 
by  ignoring  all  terms  except  the  leading  ( i.e highest  order  or  dominant)  terms. 
As  an  example,  suppose  that  a  particular  algorithm  requires  a  total  of 

m3  +  3m2n  +  mn  +  4  mn2  +  5m  +  22 

flops,  where  m  and  n  are  problem  dimensions.  We  would  normally  simplify  this 
flop  count  to 

m3  +  3m  2  n  +  4mn2 

flops,  since  these  are  the  leading  terms  in  the  problem  dimensions  m  and  n.  If 
in  addition  we  assumed  that  m  <C  n,  we  would  further  simplify  the  flop  count  to 
4  mn2 . 
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Flop  counts  were  originally  popularized  when  floating-point  operations  were  rel¬ 
atively  slow,  so  counting  the  number  gave  a  good  estimate  of  the  total  computation 
time.  This  is  no  longer  the  case:  Issues  such  as  cache  boundaries  and  locality  of 
reference  can  dramatically  affect  the  computation  time  of  a  numerical  algorithm. 
However,  flop  counts  can  still  give  us  a  good  rough  estimate  of  the  computation 
time  of  a  numerical  algorithm,  and  how  the  time  grows  with  increasing  problem 
size.  Since  a  flop  count  no  longer  accurately  predicts  the  computation  time  of  an 
algorithm,  we  usually  pay  most  attention  to  its  order  or  orders,  i.e.,  its  largest 
exponents,  and  ignore  differences  in  flop  counts  smaller  than  a  factor  of  two  or  so. 
For  example,  an  algorithm  with  flop  count  5 n2  is  considered  comparable  to  one 
with  a  flop  count  4n2,  but  faster  than  an  algorithm  with  flop  count  (l/3)n3. 


C.l. 2  Cost  of  basic  matrix-vector  operations 

Vector  operations 

To  compute  the  inner  product  xTy  of  two  vectors  x,  y  £  R"  we  form  the  products 
Xtyi,  and  then  add  them,  which  requires  n  multiplies  and  n—  1  additions,  or  2n—  1 
flops.  As  mentioned  above,  we  keep  only  the  leading  term,  and  say  that  the  inner 
product  requires  2 n  flops,  or  even  more  approximately,  order  n  flops.  A  scalar- 
vector  multiplication  ax,  where  a  £  R  and  x  £  R"  costs  n  flops.  The  addition 
x  +  y  of  two  vectors  x,y  £  R"  also  costs  n  flops. 

If  the  vectors  x  and  y  are  sparse,  i.e.,  have  only  a  few  nonzero  terms,  these 
basic  operations  can  be  carried  out  faster  (assuming  the  vectors  are  stored  using 
an  appropriate  data  structure).  For  example,  if  a;  is  a  sparse  vector  with  TV  nonzero 
entries,  then  the  inner  product  xTy  can  be  computed  in  27V  flops. 

Matrix-vector  multiplication 

A  matrix- vector  multiplication  y  =  Ax  where  A  £  Rmx"  costs  2 mn  flops:  We  have 
to  calculate  m  components  of  y,  each  of  which  is  the  product  of  a  row  of  A  with 
x,  i.e.,  an  inner  product  of  two  vectors  in  Rn. 

Matrix-vector  products  can  often  be  accelerated  by  taking  advantage  of  struc¬ 
ture  in  A.  For  example,  if  A  is  diagonal,  then  Ax  can  be  computed  in  n  flops, 
instead  of  2  n2  flops  for  multiplication  by  a  general  nx  n  matrix.  More  generally,  if 
A  is  sparse,  with  only  TV  nonzero  elements  (out  of  mn),  then  27V  flops  are  needed 
to  form  Ax,  since  we  can  skip  multiplications  and  additions  with  zero. 

As  a  less  obvious  example,  suppose  the  matrix  A  has  rank  p  <C  min{m,  n},  and 
is  represented  (stored)  in  the  factored  form  A  =  UV ,  where  U  £  Rmxp,  V  £  Rpxn. 
Then  we  can  compute  Ax  by  first  computing  Vx  (which  costs  2 pn  flops),  and  then 
computing  U(Vx)  (which  costs  2 mp  flops),  so  the  total  is  2 p(m  +  n )  flops.  Since 
p  <C  min {m,  n},  this  is  small  compared  to  2 mn. 

Matrix-matrix  multiplication 

The  matrix-matrix  product  C  =  AB,  where  A  £  R”IX"  and  B  £  R”xp,  costs  2 mnp 
flops.  We  have  mp  elements  in  C  to  calculate,  each  of  which  is  an  inner  product  of 
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two  vectors  of  length  n.  Again,  we  can  often  make  substantial  savings  by  taking 
advantage  of  structure  in  A  and  B.  For  example,  if  A  and  B  are  sparse,  we  can 
accelerate  the  multiplication  by  skipping  additions  and  multiplications  with  zero. 
If  m  =  p  and  we  know  that  C  is  symmetric,  then  we  can  calculate  the  matrix 
product  in  m2n  flops,  since  we  only  have  to  compute  the  (1/2 )m(m  +  1)  elements 
in  the  lower  triangular  part. 

To  form  the  product  of  several  matrices,  we  can  carry  out  the  matrix-matrix 
multiplications  in  different  ways,  which  have  different  flop  counts  in  general.  The 
simplest  example  is  computing  the  product  D  =  ABC ,  where  A  £  Rmxn,  B  £ 
Rnxp,  and  C  £  Rpx?.  Here  we  can  compute  D  in  two  ways,  using  matrix-matrix 
multiplies.  One  method  is  to  first  form  the  product  AB  (2 mnp  flops),  and  then  form 
D  =  ( AB)C  (2 mpq  flops),  so  the  total  is  2 mp(n+q)  flops.  Alternatively,  we  can  first 
form  the  product  BC  (2 npq  flops),  and  then  form  D  =  A(BC)  (2 mnq  flops),  with  a 
total  of  2 nq(m+p)  flops.  The  first  method  is  better  when  2 mp(n+q)  <  2 nq(m+p), 
i.e.,  when 

1111 

n  q  m  p 

This  assumes  that  no  structure  of  the  matrices  is  exploited  in  carrying  out  matrix- 
matrix  products. 

For  products  of  more  than  three  matrices,  there  are  many  ways  to  parse  the 
product  into  matrix-matrix  multiplications.  Although  it  is  not  hard  to  develop  an 
algorithm  that  determines  the  best  parsing  (i.e.,  the  one  with  the  fewest  required 
flops)  given  the  matrix  dimensions,  in  most  applications  the  best  parsing  is  clear. 


C.2  Solving  linear  equations  with  factored  matrices 

C.2.1  Linear  equations  that  are  easy  to  solve 

We  start  by  examining  some  cases  for  which  Ax  =  b  is  easily  solved,  i.e.,  x  =  A~xb 
is  easily  computed. 

Diagonal  matrices 

Suppose  A  is  diagonal  and  nonsingular  (i.e.,  an  ^  0  for  all  i).  The  set  of  linear 
equations  Ax  =  b  can  be  written  as  anXi  =  bi,  i  =  1, . . . ,  n.  The  solution  is  given 
by  Xi  =  bi /an,  and  can  be  calculated  in  n  flops. 

Lower  triangular  matrices 

A  matrix  A  £  R"x"  is  lower  triangular  if  a.%j  =  0  for  j  >  i.  A  lower  triangular 
matrix  is  called  unit  lower  triangular  if  the  diagonal  elements  are  equal  to  one.  A 
lower  triangular  matrix  is  nonsingular  if  and  only  if  an  ^  0  for  all  i. 


C.2  Solving  linear  equations  with  factored  matrices 
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Suppose  A  is  lower  triangular  and  nonsingular.  The  equations  Ax  =  b  are 


From  the  first  row,  we  have  a\\Xi  =  b\ ,  from  which  we  conclude  x±  =  bi/an. 
From  the  second  row  we  have  021X1  +  022X2  =  62,  so  we  can  express  X2  as  X2  = 
(1)2  —  CL21X1) / 0,22-  (We  have  already  computed  X\,  so  every  number  on  the  righthancl 
side  is  known.)  Continuing  this  way,  we  can  express  each  component  of  x  in  terms 
of  previous  components,  yielding  the  algorithm 

xi  :=  61/an 

X2  ■=  (b2  -  o2iXi)/a22 

X3  :=  (&3  -  a3 1X1  -  a32x2)/a33 

xn  :=  {bn  a„ixi  on  2X2  •  •  •  on?n_ixn_i  )/onn. 

This  procedure  is  called  forward  substitution ,  since  we  successively  compute  the 
components  of  x  by  substituting  the  known  values  into  the  next  equation. 

Let  us  give  a  flop  count  for  forward  substitution.  We  start  by  calculating  X\  (1 
flop).  We  substitute  Xi  in  the  second  equation  to  find  X2  (3  flops),  then  substitute 
Xi  and  X2  in  the  third  equation  to  find  x3  (5  flops),  etc.  The  total  number  of  flops 
is 

1  +  3  +  5  H - +  (2n  —  1)  =  n2. 

Thus,  when  A  is  lower  triangular  and  nonsingular,  we  can  compute  x  =  A_1b  in 
n2  flops. 

If  the  matrix  A  has  additional  structure,  in  addition  to  being  lower  triangular, 
then  forward  substitution  can  be  more  efficient  than  n2  flops.  For  example,  if  A 
is  sparse  (or  banded),  with  at  most  k  nonzero  entries  per  row,  then  each  forward 
substitution  step  requires  at  most  2fc  +  l  flops,  so  the  overall  flop  count  is  2(fc  +  l)n, 
or  2 kn  after  dropping  the  term  2 n. 

Upper  triangular  matrices 

A  matrix  A  £  R"xn  is  upper  triangidar  if  AT  is  lower  triangular,  i.e.,  if  a^-  =  0  for 
j  <  i.  We  can  solve  linear  equations  with  nonsingular  upper  triangular  coefficient 
matrix  in  a  way  similar  to  forward  substitution,  except  that  we  start  by  calculating 
xn,  then  xn-i,  and  so  on.  The  algorithm  is 

Xn  • —  bn/ann 

Xn—  1  •  {bn  —  1  ®n-l,nA)/^n-l,n-l 

Xn— 2  :=  {bn— 2  &n—2,n—lXn—l  fln-2,nXn)/  ^n— 2,n— 2 

xi  :=  {bi  -  a12x2  -  a13x3 - «inxn)/an. 
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This  is  called  backward  substitution  or  back  substitution  since  we  determine  the 
coefficients  in  backward  order.  The  cost  to  compute  x  =  A~xb  via  backward 
substitution  is  n 2  flops.  If  A  is  upper  triangular  and  sparse  (or  banded),  with  at 
most  k  nonzero  entries  per  row,  then  back  substitution  costs  2 kn  flops. 

Orthogonal  matrices 

A  matrix  A  £  Rrax"  is  orthogonal  if  AT A  =  I,  i.e.,  A-1  =  AT .  In  this  case  we  can 
compute  x  =  A~xb  by  a  simple  matrix-vector  product  x  =  ATb ,  which  costs  2 n2 
in  general. 

If  the  matrix  A  has  additional  structure,  we  can  compute  x  =  A~xb  even  more 
efficiently  than  2 n2  flops.  For  example,  if  A  has  the  form  A  =  I  —  2 uuT,  where 
||w||2  =  1,  we  can  compute 

x  =  A~xb  =  (/  —  2  wuT)Tb  =  b  —  2  (uTb)u 

by  first  computing  uTb ,  then  forming  b  —  2(iiTb)u,  which  costs  4 n  flops. 

Permutation  matrices 

Let  7 r  =  (7Ti, . . . ,  7 rn)  be  a  permutation  of  (1,  2, . . . ,  n).  The  associated  permutation 
matrix  A  £  R"xn  js  given  by 

A  ={  1  j  =  7Ti 
lJ  \  0  otherwise. 

In  each  row  (or  column)  of  a  permutation  matrix  there  is  exactly  one  entry  with 
value  one;  all  other  entries  are  zero.  Multiplying  a  vector  by  a  permutation  matrix 
simply  permutes  its  coefficients: 

Ax  =  {Xm  ,  •  •  •  ,  Xnn  )  • 

The  inverse  of  a  permutation  matrix  is  the  permutation  matrix  associated  with  the 
inverse  permutation  .  This  turns  out  to  be  AT ,  which  shows  that  permutation 
matrices  are  orthogonal. 

If  A  is  a  permutation  matrix,  solving  Ax  =  b  is  very  easy:  x  is  obtained  by 
permuting  the  entries  of  b  by  7 r_1.  This  requires  no  floating  point  operations, 
according  to  our  definition  (but,  depending  on  the  implementation,  might  involve 
copying  floating  point  numbers).  We  can  reach  the  same  conclusion  from  the 
equation  x  =  ATb.  The  matrix  AT  (like  A)  has  only  one  nonzero  entry  per  row,  with 
value  one.  Thus  no  additions  are  required,  and  the  only  multiplications  required 
are  by  one. 


C.2.2  The  factor-solve  method 

The  basic  approach  to  solving  Ax  =  b  is  based  on  expressing  A  as  a  product  of 
nonsingular  matrices, 


A  —  AiA2  •  •  •  Afc, 
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so  that 

x  =  A-lb  =  A^A~k\...A^b. 

We  can  compute  x  using  this  formula,  working  from  right  to  left: 

z\  :=  Afxb 

Z2  :=  A^xz\  =  Af1Af1b 


%k—  1  • —  ^k  —  lZk—2  —  ‘  *  '  A^  b 

x  :=  A^Zk-i  =  A^1---Af1b. 

The  ith  step  of  this  process  requires  computing  Zj  =  A~1Zi-i,  i.e.,  solving  the 
linear  equations  AiZi  =  Zi-\.  If  each  of  these  equations  is  easy  to  solve  ( e.g .,  if  Ai 
is  diagonal,  lower  or  upper  triangular,  a  permutation,  etc.),  this  gives  a  method  for 
computing  x  =  A^xb. 

The  step  of  expressing  A  in  factored  form  (i.e.,  computing  the  factors  Af)  is 
called  the  factorization  step,  and  the  process  of  computing  x  =  A~1  b  recursively, 
by  solving  a  sequence  problems  of  the  form  AtZi  =  Zi-i,  is  often  called  the  solve 
step.  The  total  flop  count  for  solving  Ax  =  b  using  this  factor-solve  method  is  f  +  s, 
where  /  is  the  flop  count  for  computing  the  factorization,  and  s  is  the  total  flop 
count  for  the  solve  step.  In  many  cases,  the  cost  of  the  factorization,  /,  dominates 
the  total  solve  cost  s.  In  this  case,  the  cost  of  solving  Ax  =  b,  i.e.,  computing 
x  =  A~lb,  is  just  /. 


Solving  equations  with  multiple  righthand  sides 

Suppose  we  need  to  solve  the  equations 

Ax±  =  bi,  Ax2  =b2,  •  •  • ,  Axm  =  bm, 

where  A  £  jg  nonsingUiar  jn  other  words,  we  need  to  solve  m  sets  of 

linear  equations,  with  the  same  coefficient  matrix,  but  different  righthand  sides. 
Alternatively,  we  can  think  of  this  as  computing  the  matrix 

X  =  A_1B 


where 

X=[Xl  x2  ■■■  xm  ]  £  Rnxrra,  B=[bx  b2  •••  bm  ]  £  Rnxm. 

To  do  this,  we  first  factor  A,  which  costs  /.  Then  for  i  =  1 ,...  ,m  we  compute 
A~1bi  using  the  solve  step.  Since  we  only  factor  A  once,  the  total  effort  is 

/  +  ms. 

In  other  words,  we  amortize  the  factorization  cost  over  the  set  of  m  solves.  Had  we 
(needlessly)  repeated  the  factorization  step  for  each  i,  the  cost  would  be  m(f  +  s). 

When  the  factorization  cost  /  dominates  the  solve  cost  s,  the  factor-solve 
method  allows  us  to  solve  a  small  number  of  linear  systems,  with  the  same  co¬ 
efficient  matrix,  at  essentially  the  same  cost  as  solving  one.  This  is  because  the 
most  expensive  step,  the  factorization,  is  done  only  once. 
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We  can  use  the  factor-solve  method  to  compute  the  inverse  A -1  by  solving 
Ax  =  e,i  for  i  =  1, . . . ,  n,  i.e.,  by  computing  A~lI.  This  requires  one  factorization 
and  n  solves,  so  the  cost  is  /  +  ns. 


C.3  LU,  Cholesky,  and  LDLT  factorization 

C.3.1  LU  factorization 

Every  nonsingular  matrix  A  £  R"x"  can  be  factored  as 

A  =  PLU 

where  P  £  R"  xn  is  a  permutation  matrix,  L  £  Rrax"  is  unit  lower  triangular,  and 
U  £  R”x"  is  upper  triangular  and  nonsingular.  This  is  called  the  L  U  factorization 
of  A.  We  can  also  write  the  factorization  as  PT A  =  LU ,  where  the  matrix  PT A  is 
obtained  from  A  by  re-ordering  the  rows.  The  standard  algorithm  for  computing  an 
LU  factorization  is  called  Gaussian  elimination  with  partial  pivoting  or  Gaussian 
elimination  with  row  pivoting.  The  cost  is  (2/3 )n3  flops  if  no  structure  in  A  is 
exploited,  which  is  the  case  we  consider  first. 

Solving  sets  of  linear  equations  using  the  LU  factorization 

The  LU  factorization,  combined  with  the  factor-solve  approach,  is  the  standard 
method  for  solving  a  general  set  of  linear  equations  Ax  =  b. 


Algorithm  C.l  Solving  linear  equations  by  LU  factorization. 

given  a  set  of  linear  equations  Ax  =  b,  with  A  nonsingular. 

1.  LU  factorization.  Factor  A  as  A  =  PLU  ((2/3 )n3  flops). 

2.  Permutation.  Solve  Pz\  =  b  (0  flops). 

3.  Forward  substitution.  Solve  Lz2  =  zi  (n2  flops). 

4.  Backward  substitution.  Solve  Ux  =  Z2  {n2  flops). 


The  total  cost  is  (2/3 )n3  +  2?i2,  or  (2/3 )n3  flops  if  we  keep  only  the  leading  term. 

If  we  need  to  solve  multiple  sets  of  linear  equations  with  different  righthand 
sides,  i.e.,  Axi  =  6* ,  i  =  1, . . . ,  m,  the  cost  is 

(2/3  )n3  +  2  mn2, 

since  we  factor  A  once,  and  carry  out  m  pairs  of  forward  and  backward  substi¬ 
tutions.  For  example,  we  can  solve  two  sets  of  linear  equations,  with  the  same 
coefficient  matrix  but  different  righthand  sides,  at  essentially  the  same  cost  as 
solving  one.  We  can  compute  the  inverse  A~l  by  solving  the  equations  Ax^  =  e*, 
where  Xi  is  the  ith  column  of  A _1,  and  is  the  ith  unit  vector.  This  costs  (8/3)n3, 
i.e.,  about  3n3  flops. 

If  the  matrix  A  has  certain  structure,  for  example  banded  or  sparse,  the  LU  fac¬ 
torization  can  be  computed  in  less  than  (2/3)n3  flops,  and  the  associated  forward 
and  backward  substitutions  can  also  be  carried  out  more  efficiently. 
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LU  factorization  of  banded  matrices 

Suppose  the  matrix  A  €  R"xn  is  banded ,  i.e.,  aij  =  0  if  \i  —  j\  >  k,  where 
k  <  n  —  1  is  called  the  bandwidth  of  A.  We  are  interested  in  the  case  where  k  <C  n, 
i.e.,  the  bandwidth  is  much  smaller  than  the  size  of  the  matrix.  In  this  case  an 
LU  factorization  of  A  can  be  computed  in  roughly  4 nk2  flops.  The  resulting  upper 
triangular  matrix  U  has  bandwidth  at  most  2k,  and  the  lower  triangular  matrix  L 
has  at  most  k  +  1  nonzeros  per  column,  so  the  forward  and  back  substitutions  can 
be  carried  out  in  order  6 nk  flops.  Therefore  if  A  is  banded,  the  linear  equations 
Ax  =  b  can  be  solved  in  about  4 nk2  flops. 

LU  factorization  of  sparse  matrices 

When  the  matrix  A  is  sparse,  the  LU  factorization  usually  includes  both  row  and 
column  permutations,  i.e.,  A  is  factored  as 


A  =  PiLUP2, 

where  P\  and  P2  are  permutation  matrices,  L  is  lower  triangular,  and  U  is  upper 
triangular.  If  the  factors  L  and  U  are  sparse,  the  forward  and  backward  substi¬ 
tutions  can  be  carried  out  efficiently,  and  we  have  an  efficient  method  for  solving 
Ax  =  b.  The  sparsity  of  the  factors  L  and  U  depends  on  the  permutations  Pi  and 
P2,  which  are  chosen  in  part  to  yield  relatively  sparse  factors. 

The  cost  of  computing  the  sparse  LU  factorization  depends  in  a  complicated 
way  on  the  size  of  A,  the  number  of  nonzero  elements,  its  sparsity  pattern,  and 
the  particular  algorithm  used,  but  is  often  dramatically  smaller  than  the  cost  of  a 
dense  LU  factorization.  In  many  cases  the  cost  grows  approximately  linearly  with 
n,  when  n  is  large.  This  means  that  when  A  is  sparse,  we  can  solve  Ax  =  b  very 
efficiently,  often  with  an  order  approximately  n. 


C.3. 2  Cholesky  factorization 

If  A  e  Rnxn  is  symmetric  and  positive  definite,  then  it  can  be  factored  as 

A  =  LLt 

where  L  is  lower  triangular  and  nonsingular  with  positive  diagonal  elements.  This 
is  called  the  Cholesky  factorization  of  A,  and  can  be  interpreted  as  a  symmetric 
LU  factorization  (with  L  =  UT).  The  matrix  L,  which  is  uniquely  determined 
by  A,  is  called  the  Cholesky  factor  of  A.  The  cost  of  computing  the  Cholesky 
factorization  of  a  dense  matrix,  i.e.,  without  exploiting  any  structure,  is  (l/3)?r3 
flops,  half  the  cost  of  an  LU  factorization. 

Solving  positive  definite  sets  of  equations  using  Cholesky  factorization 

The  Cholesky  factorization  can  be  used  to  solve  Ax  =  b  when  A  is  symmetric 
positive  definite. 
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Algorithm  C.2  Solving  linear  equations  by  Cholesky  factorization. 

given  a  set  of  linear  equations  Ax  =  b,  with  A  £  S++. 

1.  Cholesky  factorization.  Factor  A  as  A  =  LLT  ((1/3 )n3  flops). 

2.  Forward  substitution.  Solve  Lzi  =  b  ( n 2  flops). 

3.  Backward  substitution.  Solve  LTx  =  z\  (n2  flops). 


The  total  cost  is  (l/3)n3  +  2?i2,  or  roughly  (1/3 )n3  flops. 

There  are  specialized  algorithms,  with  a  complexity  much  lower  than  (l/3)n3, 
for  Cholesky  factorization  of  banded  and  sparse  matrices. 

Cholesky  factorization  of  banded  matrices 

If  A  is  symmetric  positive  definite  and  banded  with  bandwidth  k,  then  its  Cholesky 
factor  L  is  banded  with  bandwidth  k.  and  can  be  calculated  in  nk2  flops.  The  cost 
of  the  associated  solve  step  is  4 nk  flops. 

Cholesky  factorization  of  sparse  matrices 

When  A  is  symmetric  positive  definite  and  sparse,  it  is  usually  factored  as 

A  =  PLLtPt, 

where  P  is  a  permutation  matrix  and  L  is  lower  triangular  with  positive  diagonal 
elements.  We  can  also  express  this  as  PT AP  =  LLT,  i.e.,  LLT  is  the  Cholesky 
factorization  of  PT  AP.  We  can  interpret  this  as  first  re-ordering  the  variables  and 
equations,  and  then  forming  the  (standard)  Cholesky  factorization  of  the  resulting 
permuted  matrix.  Since  PT AP  is  positive  definite  for  any  permutation  matrix  P, 
we  are  free  to  choose  any  permutation  matrix;  for  each  choice  there  is  a  unique 
associated  Cholesky  factor  L.  The  choice  of  P,  however,  can  greatly  affect  the 
sparsity  of  the  factor  L ,  which  in  turn  can  greatly  affect  the  efficiency  of  solving 
Ax  =  b.  Various  heuristic  methods  are  used  to  select  a  permutation  P  that  leads 
to  a  sparse  factor  L. 


Example  C.l  Cholesky  factorization  with  an  arrow  sparsity  pattern.  Consider  a 
sparse  matrix  of  the  form 


where  D  £  Rnxn  is  positive  diagonal,  and  u  £  R" .  It  can  be  shown  that  A  is  positive 
definite  if  uT  D~1u  <  1.  The  Cholesky  factorization  of  A  is 
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(C.2) 


where  L  is  lower  triangular  with  LLT  =  D  —  uuT .  For  general  u,  the  matrix  D  —  uuT 
is  dense,  so  we  can  expect  L  to  be  dense.  Although  the  matrix  A  is  very  sparse 
(most  of  its  rows  have  just  two  nonzero  elements),  its  Cholesky  factors  are  almost 
completely  dense. 
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On  the  other  hand,  suppose  we  permute  the  first  row  and  column  of  A  to  the  end. 
After  this  re-ordering,  we  obtain  the  Cholesky  factorization 


D  u 

D1/2  0 

'  D1/2  D~1/2 u 

uT  1 

uT D~1/2  Vl  -urD~1u 

0  Vl  —  uTD~1u 

Now  the  Cholesky  factor  has  a  diagonal  1,1  block,  so  it  is  very  sparse. 

This  example  illustrates  that  the  re-ordering  greatly  affects  the  sparsity  of  the  Cholesky 
factors.  Here  it  was  quite  obvious  what  the  best  permutation  is,  and  all  good  re¬ 
ordering  heuristics  would  select  this  re-ordering  and  permute  the  dense  row  and 
column  to  the  end.  For  more  complicated  sparsity  patterns,  it  can  be  very  difficult 
to  find  the  ‘best’  re-ordering  ( i.e .,  resulting  in  the  greatest  number  of  zero  elements 
in  L),  but  various  heuristics  provide  good  suboptimal  permutations. 


For  the  sparse  Cholesky  factorization,  the  re-ordering  permutation  P  is  often 
determined  using  only  sparsity  pattern  of  the  matrix  A,  and  not  the  particular 
numerical  values  of  the  nonzero  elements  of  A.  Once  P  is  chosen,  we  can  also 
determine  the  sparsity  pattern  of  L  without  knowing  the  numerical  values  of  the 
nonzero  entries  of  A.  These  two  steps  combined  are  called  the  symbolic  factorization 
of  A,  and  form  the  first  step  in  a  sparse  Cholesky  factorization.  In  contrast,  the 
permutation  matrices  in  a  sparse  LU  factorization  do  depend  on  the  numerical 
values  in  A,  in  addition  to  its  sparsity  pattern. 

The  symbolic  factorization  is  then  followed  by  the  numerical  factorization ,  i.e., 
the  calculation  of  the  nonzero  elements  of  L.  Software  packages  for  sparse  Cholesky 
factorization  often  include  separate  routines  for  the  symbolic  and  the  numerical 
factorization.  This  is  useful  in  many  applications,  because  the  cost  of  the  symbolic 
factorization  is  significant,  and  often  comparable  to  the  numerical  factorization. 
Suppose,  for  example,  that  we  need  to  solve  m  sets  of  linear  equations 

A\x  =  bi,  A2x  =  b2 ,  ...,  Amx  =  bm 

where  the  matrices  Ai  are  symmetric  positive  definite,  with  different  numerical 
values,  but  the  same  sparsity  pattern.  Suppose  the  cost  of  a  symbolic  factorization 
is  /symb)  the  cost  of  a  numerical  factorization  is  /nUm,  and  the  cost  of  the  solve  step 
is  s.  Then  we  can  solve  the  m  sets  of  linear  equations  in 

/symb  T  ^(/num  *  S ) 

flops,  since  we  only  need  to  carry  out  the  symbolic  factorization  once,  for  all  m  sets 
of  equations.  If  instead  we  carry  out  a  separate  symbolic  factorization  for  each  set 
of  linear  equations,  the  flop  count  is  m(/sym b  +  /num  +  s)- 


C.3. 3  LDLt  factorization 

Every  nonsingular  symmetric  matrix  A  can  be  factored  as 

A  =  PLDLtPt 
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where  P  is  a  permutation  matrix,  L  is  lower  triangular  with  positive  diagonal 
elements,  and  D  is  block  diagonal,  with  nonsingular  lxl  and  2x2  diagonal 
blocks.  This  is  called  an  LDLT  factorization  of  A.  (The  Cholesky  factorization 
can  be  considered  a  special  case  of  LDLT  factorization,  with  P  =  I  and  D  =  I.) 
An  LDLt  factorization  can  be  computed  in  (l/3)n3  flops,  if  no  structure  of  A  is 
exploited. 

Algorithm  C.3  Solving  linear  equations  by  LDLT  factorization. 

given  a  set  of  linear  equations  Ax  =  b,  with  A  G  Sn  nonsingular. 

1.  LDLt  factorization.  Factor  A  as  A  =  PLDLT P  ((l/3)n3  flops). 

2.  Permutation.  Solve  Pzi  =  b  (0  flops). 

3.  Forward  substitution.  Solve  Lz2  =  zi  (n2  flops). 

4.  (Block)  diagonal  solve.  Solve  Dzz  =  Z2  (order  n  flops). 

5.  Backward  substitution.  Solve  LTZ4  =  zz  ( n 2  flops). 

6.  Permutation.  Solve  PTx  =  24  (0  flops). 

The  total  cost  is,  keeping  only  the  dominant  term,  (l/3)n3  flops. 

LDLt  factorization  of  banded  and  sparse  matrices 

As  with  the  LU  and  Cholesky  factorizations,  there  are  specialized  methods  for 
calculating  the  LDLT  factorization  of  a  sparse  or  banded  matrix.  These  are  similar 
to  the  analogous  methods  for  Cholesky  factorization,  with  the  additional  factor  D. 
In  a  sparse  LDLT  factorization,  the  permutation  matrix  P  cannot  be  chosen  only 
on  the  basis  of  the  sparsity  pattern  of  A  (as  in  a  sparse  Cholesky  factorization);  it 
also  depends  on  the  particular  nonzero  values  in  the  matrix  A. 


C.4  Block  elimination  and  Schur  complements 

C.4.1  Eliminating  a  block  of  variables 


In  this  section  we  describe  a  general  method  that  can  be  used  to  solve  Ax  =  b 
by  first  eliminating  a  subset  of  the  variables,  and  then  solving  a  smaller  system 
of  linear  equations  for  the  remaining  variables.  For  a  dense  unstructured  matrix, 
this  approach  gives  no  advantage.  But  when  the  submatrix  of  A  associated  with 
the  eliminated  variables  is  easily  factored  (for  example,  if  it  is  block  diagonal  or 
banded)  the  method  can  be  substantially  more  efficient  than  a  general  method. 

Suppose  we  partition  the  variable  x  G  R"  into  two  blocks  or  subvectors, 


where  X\  G  R"1,  x 2  G  R"2.  We  conformally  partition  the  linear  equations  Ax  =  b 


An  A 12 

A  21  A11  J 


as 


Xi 

X2 


(C.3) 
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where  An  £  R"lXni,  A22  £  RIi2X"2.  Assuming  that  the  submatrix  An  is  invert¬ 
ible,  we  can  eliminate  X\  from  the  equations,  as  follows.  Using  the  first  equation, 
we  can  express  x\  in  terms  of  X2' 

Xi  =  An(bi  -  A12X2).  (C.4) 

Substituting  this  expression  into  the  second  equation  yields 

(A22  —  A2iA111  A12)x2  =  62  —  (C.5) 

We  refer  to  this  as  the  reduced  equation  obtained  by  eliminating  xi  from  the  orig¬ 
inal  equation.  The  reduced  equation  (C.5)  and  the  equation  (C.4)  together  are 
equivalent  to  the  original  equations  (C.3).  The  matrix  appearing  in  the  reduced 
equation  is  called  the  Schur  complement  of  the  first  block  An  in  A: 

S  =  A22  —  A2lA111Ai2 

(see  also  §A.5.5).  The  Schur  complement  S  is  nonsingular  if  and  only  if  A  is 
nonsingular. 

The  two  equations  (C.5)  and  (C.4)  give  us  an  alternative  approach  to  solving 
the  original  system  of  equations  (C.3).  We  first  form  the  Schur  complement  S,  then 
find  X2  by  solving  (C.5),  and  then  calculate  X\  from  (C.4).  We  can  summarize  this 
method  as  follows. 


Algorithm  C.4  Solving  linear  equations  by  block  elimination. 

given  a  nonsingular  set  of  linear  equations  (C.3),  with  An  nonsingular. 

1.  Form  A^An  and  A^b\. 

2.  Form  S  =  A22  —  A2iAj~11Ai2  and  b  =  62  —  A2iA1~116i. 

3.  Determine  *2  by  solving  Sx 2  =  b. 

4.  Determine  xi  by  solving  AnXi  =  bi  —  AnX2- 


Remark  C.l  Interpretation  as  block  factor- solve.  Block  elimination  can  be  interpreted 
in  terms  of  the  factor-solve  approach  described  in  §C.2.2,  based  on  the  factorization 


r  An  A12 1 

'An  O' 

\  I  A^A12  1 

1 

<N 

CN 

_ 1 

A21  S 

- 1 

1 — 1 

0 

_ 1 

which  can  be  considered  a  block  LU  factorization.  This  block  LU  factorization  sug¬ 
gests  the  following  method  for  solving  (C.3).  We  first  do  a  ‘block  forward  substitution’ 
to  solve 


'  All 

0  ' 

Zl 

'  61  " 

A21 

s 

Z2 

^2 

'  I 

An  A12 

Xl 

Zl 

0 

/ 

X2 

Z2 

and  then  solve 
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by  ‘block  backward  substitution’.  This  yields  the  same  expressions  as  the  block 
elimination  method: 


Zl 

=  Au'ftl 

22 

=  S_1(&2- 

A21Z1) 

X2 

=  22 

Xl 

=  zi  - 

A\2Z2- 

In  fact,  the  modern  approach  to  the  factor-solve  method  is  based  on  block  factor 
and  solve  steps  like  these,  with  the  block  sizes  optimally  chosen  for  the  processor  (or 
processors),  cache  sizes,  etc. 


Complexity  analysis  of  block  elimination  method 

To  analyze  the  (possible)  advantage  of  solving  the  set  of  linear  equations  using 
block  elimination,  we  carry  out  a  flop  count.  We  let  /  and  s  denote  the  cost  of 
factoring  An  and  carrying  out  the  associated  solve  step,  respectively.  To  keep  the 
analysis  simple  we  assume  (for  now)  that  A12,  A22,  and  A2 1  are  treated  as  dense, 
unstructured  matrices.  The  flop  counts  for  each  of  the  four  steps  in  solving  Ax  =  b 
using  block  elimination  are: 

1.  Computing  A^1  A-i2  and  A^bi  requires  factoring  An  and  n2  +  1  solves,  so 
it  costs  /  +  ( n2  +  l)s,  or  just  /  +  n2s,  dropping  the  dominated  term  s. 

2.  Forming  the  Schur  complement  S  requires  the  matrix  multiply  A2i(An Ai2) , 
which  costs  2n%ni,  and  an  n2  x  n2  matrix  subtraction,  which  costs  n2  (and 
can  be  dropped).  The  cost  of  forming  b  =  b2  —  A2\Anbi  is  dominated  by  the 
cost  of  forming  S ,  and  so  can  be  ignored.  The  total  cost  of  step  2,  ignoring 
dominated  terms,  is  then  2ri2ni. 

3.  To  compute  x2  =  we  factor  S  and  solve,  which  costs  {2/3)n2. 

4.  Forming  bi~Ai2x2  costs  2nin2+ni  flops.  To  compute  X\  =  An(b\  —  A\2x2), 
we  can  use  the  factorization  of  An  already  computed  in  step  1,  so  only  the 
solve  is  necessary,  which  costs  s.  Both  of  these  costs  are  dominated  by  other 
terms,  and  can  be  ignored. 

The  total  cost  is  then 

f  +  n2s  +  2n%ni  +  (2/3)n2  (C.6) 

flops. 

Eliminating  an  unstructured  matrix 

We  first  consider  the  case  when  no  structure  in  An  is  exploited.  We  factor  An 
using  a  standard  LU  factorization,  so  /  =  (2/3)nf ,  and  then  solve  using  a  forward 
and  a  backward  substitution,  so  s  =  2 n\.  The  flop  count  for  solving  the  equations 
via  block  elimination  is  then 


(2/3)nf  +  n2(2nj)  +  2n\n\  +  (2/3)?i2  =  (2/3)(ni  +  ?r2)3, 
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which  is  the  same  as  just  solving  the  larger  set  of  equations  using  a  standard  LU 
factorization.  In  other  words,  solving  a  set  of  equations  by  block  elimination  gives 
no  advantage  when  no  structure  of  An  is  exploited. 

On  the  other  hand,  when  the  structure  of  An  allows  us  to  factor  and  solve 
more  efficiently  than  the  standard  method,  block  elimination  can  be  more  efficient 
than  applying  the  standard  method. 

Eliminating  a  diagonal  matrix 

If  An  is  diagonal,  no  factorization  is  needed,  and  we  can  carry  out  a  solve  in  n \ 
flops,  so  we  have  /  =  0  and  s  =  n\.  Substituting  these  values  into  (C.6)  and 
keeping  only  the  leading  terms  yields 

2n\n\  +  (2/3)ri2, 

flops,  which  is  far  smaller  than  (2/3)(ni+n2)3,  the  cost  using  the  standard  method. 
In  particular,  the  flop  count  of  the  standard  method  grows  cubicly  in  m,  whereas 
for  block  elimination  the  flop  count  grows  only  linearly  in  n\. 

Eliminating  a  banded  matrix 

If  An  is  banded  with  bandwidth  fc,  we  can  carry  out  the  factorization  in  about 
/  =  4k2ni  flops,  and  the  solve  can  be  done  in  about  s  =  6kn±  flops.  The  overall 
complexity  of  solving  Ax  =  b  using  block  elimination  is 

4  k2n\  +  Qn2kn\  +  2n2n\  +  {2/A)n\ 

flops.  Assuming  k  is  small  compared  to  rii  and  n2,  this  simplifies  to  2n|ni  +  (2/3)ri2, 
the  same  as  when  An  is  diagonal.  In  particular,  the  complexity  grows  linearly  in 
ni,  as  opposed  to  cubicly  in  ni  for  the  standard  method. 

A  matrix  for  which  An  is  banded  is  sometimes  called  an  arrow  matrix  since  the 
sparsity  pattern,  when  ni  n2,  looks  like  an  arrow  pointing  down  and  right.  Block 
elimination  can  solve  linear  equations  with  arrow  structure  far  more  efficiently  than 
the  standard  method. 

Eliminating  a  block  diagonal  matrix 

Suppose  that  An  is  block  diagonal,  with  (square)  block  sizes  mi, . . . ,  m^,  where 
ni  =  mi  +  •••  +  mu-  In  this  case  we  can  factor  An  by  factoring  each  block 
separately,  and  similarly  we  can  carry  out  the  solve  step  on  each  block  separately. 
Using  standard  methods  for  these  we  find 

/  =  (2/3)mf  +  •  •  •  +  ( 2/2>)m\ ,  s  =  2  m2  +  •  •  •  +  2  m|, 

so  the  overall  complexity  of  block  elimination  is 

k  k  k 

(2/3)  E  m3  +  2 n2  E  mi  +  2n2  E  TO*  +  (2/3)n3. 

i=  1  2—1  2—1 

If  the  block  sizes  are  small  compared  to  ni  and  ni  n2,  the  savings  obtained  by 
block  elimination  is  dramatic. 
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The  linear  equations  Ax  —  b ,  where  An  is  block  diagonal,  are  called  partially 
separable  for  the  following  reason.  If  the  subvector  x2  is  fixed,  the  remaining 
equations  decouple  into  k  sets  of  independent  linear  equations  (which  can  be  solved 
separately).  The  subvector  X2  is  sometimes  called  the  complicating  variable  since 
the  equations  are  much  simpler  when  X2  is  fixed.  Using  block  elimination,  we 
can  solve  partially  separable  linear  equations  far  more  efficiently  than  by  using  a 
standard  method. 

Eliminating  a  sparse  matrix 

If  An  is  sparse,  we  can  eliminate  An  using  a  sparse  factorization  and  sparse  solve 
steps,  so  the  values  of  /  and  s  in  (C.6)  are  much  less  than  for  unstructured  An- 
When  An  in  (C.3)  is  sparse  and  the  other  blocks  are  dense,  and  n2  <C  n  1,  we 
say  that  A  is  a  sparse  matrix  with  a  few  dense  rows  and  columns.  Eliminating 
the  sparse  block  An  provides  an  efficient  method  for  solving  equations  which  are 
sparse  except  for  a  few  dense  rows  and  columns. 

An  alternative  is  to  simply  apply  a  sparse  factorization  algorithm  to  the  entire 
matrix  A.  Most  sparse  solvers  will  handle  dense  rows  and  columns,  and  select  a 
permutation  that  results  in  sparse  factors,  and  hence  fast  factorization  and  solve 
times.  This  is  more  straightforward  than  using  block  elimination,  but  often  slower, 
especially  in  applications  where  we  can  exploit  structure  in  the  other  blocks  (see, 
e.g.,  example  C.4). 


Remark  C.2  As  already  suggested  in  remark  C.l,  these  two  methods  for  solving  sys¬ 
tems  with  a  few  dense  rows  and  columns  are  closely  related.  Applying  the  elimination 
method  by  factoring  An  and  S  as 


A11=PiL1U1P2,  S  =  P3L2U2, 

can  be  interpreted  as  factoring  A  as 

An  A12 
A21  A22 

P2  0  ' 

0  /  J  ’ 

followed  by  forward  and  backward  substitutions. 


Pi 

0 

0 

p3 

Li  0 

p3TA2iP2Tur1  l2 


Ui  L-'PfAn 
0  U2 


C.4. 2  Block  elimination  and  structure 

Symmetry  and  positive  definiteness 

There  are  variants  of  the  block  elimination  method  that  can  be  used  when  A  is 
symmetric,  or  symmetric  and  positive  definite.  When  A  is  symmetric,  so  are  An 
and  the  Schur  complement  S,  so  a  symmetric  factorization  can  be  used  for  An 
and  S.  Symmetry  can  also  be  exploited  in  the  other  operations,  such  as  the  matrix 
multiplies.  Overall  the  savings  over  the  nonsymmetric  case  is  around  a  factor  of 
two. 
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Positive  definiteness  can  also  be  exploited  in  block  elimination.  When  A  is  sym¬ 
metric  and  positive  definite,  so  are  An  and  the  Schur  complement  S ,  so  Cholesky 
factorizations  can  be  used. 

Exploiting  structure  in  other  blocks 

Our  complexity  analysis  above  assumes  that  we  exploit  no  structure  in  the  matrices 
A12,  A21 ,  A22 ,  and  the  Schur  complement  S,  i.e.,  they  are  treated  as  dense.  But  in 
many  cases  there  is  structure  in  these  blocks  that  can  be  exploited  in  forming  the 
Schur  complement,  factoring  it,  and  carrying  out  the  solve  steps.  In  such  cases  the 
computational  savings  of  the  block  elimination  method  over  a  standard  method 
can  be  even  higher. 


Example  C.2  Block  triangular  equations.  Suppose  that  A12  =  0,  i.e.,  the  linear 
equations  Ax  =  b  have  block  lower  triangular  structure: 


An 

A21 


Xi 

'  b 1  " 

X2 

In  this  case  the  Schur  complement  is  just  S  =  A22,  and  the  block  elimination  method 
reduces  to  block  forward  substitution: 


xi  :=  A111&i 

X2  '■=  A22  {b2  ~  A2lX!) . 


Example  C.3  Block  diagonal  and  banded  systems.  Suppose  that  An  is  block  diagonal, 
with  maximum  block  size  l  x  l,  and  that  A12,  A21,  and  A22  are  banded,  say  with 
bandwidth  k.  I11  this  case,  Ay,1  is  also  block  diagonal,  with  the  same  block  sizes  as 
An.  Therefore  the  product  Aj"11Ai2  is  also  banded,  with  bandwidth  k  +  l,  and  the 
Schur  complement,  S  =  A22  —  A2iAj~11Ai2  is  banded  with  bandwidth  2k  +  l.  This 
means  that  forming  the  Schur  complement  S  can  be  done  more  efficiently,  and  that 
the  factorization  and  solve  steps  with  S  can  be  done  efficiently.  In  particular,  for 
fixed  maximum  block  size  l  and  bandwidth  k,  we  can  solve  Ax  =  b  with  a  number  of 
flops  that  grows  linearly  with  n. 


Example  C.4  KKT  structure.  Suppose  that  the  matrix  A  has  KKT  structure,  i.e., 


An  A12 
A 12  0 


where  An  £  S^+,  and  A12  £  Rpxm  with  rankAn  =  m.  Since  An  >-  0,  we  can 
use  a  Cholesky  factorization.  The  Schur  complement  S  =  -A^Aj'/A  12  is  negative 
definite,  so  we  can  factor  —S  using  a  Cholesky  factorization. 
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C.4.3  The  matrix  inversion  lemma 

The  idea  of  block  elimination  is  to  remove  variables,  and  then  solve  a  smaller  set  of 
equations  that  involve  the  Schur  complement  of  the  original  matrix  with  respect  to 
the  eliminated  variables.  The  same  idea  can  be  turned  around:  When  we  recognize 
a  matrix  as  a  Schur  complement,  we  can  introduce  new  variables,  and  create  a 
larger  set  of  equations  to  solve.  In  most  cases  there  is  no  advantage  to  doing  this, 
since  we  end  up  with  a  larger  set  of  equations.  But  when  the  larger  set  of  equations 
has  some  special  structure  that  can  be  exploited  to  solve  it,  introducing  variables 
can  lead  to  an  efficient  method.  The  most  common  case  is  when  another  block  of 
variables  can  be  eliminated  from  the  larger  matrix. 

We  start  with  the  linear  equations 

(. A  +  BC)x  =  b ,  (C.7) 

where  A  £  R"xn  is  nonsingular,  and  B  £  Rnxp,  C  £  Rpxn.  We  introduce  a  new 
variable  y  =  Cx,  and  rewrite  the  equations  as 

Ax  +  By  =  b,  y  =  Cx, 

or,  in  matrix  form, 

'  A  B 
C  -I 

Note  that  our  original  coefficient  matrix,  A  +  BC ,  is  the  Schur  complement  of  —I 
in  the  larger  matrix  that  appears  in  (C.8).  If  we  were  to  eliminate  the  variable  y 
from  (C.8),  we  would  get  back  the  original  equation  (C.7). 

In  some  cases,  it  can  be  more  efficient  to  solve  the  larger  set  of  equations  (C.8) 
than  the  original,  smaller  set  of  equations  (C.7).  This  would  be  the  case,  for 
example,  if  A,  B ,  and  C  were  relatively  sparse,  but  the  matrix  A  +  BC  were  far 
less  sparse. 

After  introducing  the  new  variable  y,  we  can  eliminate  the  original  variable  x 
from  the  larger  set  of  equations  (C.8),  using  x  =  A-1  (b  —  By).  Substituting  this 
into  the  second  equation  y  =  Cx,  we  obtain 

(. I  +  CA-1B)y  =  CA~1b , 

so  that 

y  =  (I  +  CA~1B)~1CA~1b. 

Using  x  =  A^x(6  —  By),  we  get 

x  =  (A-1  -  A~1B(I  +  CA~1B)~1CA~1)  b.  (C.9) 

Since  b  is  arbitrary,  we  conclude  that 

(A  +  BC)-1  =  A-1  -  A~XB  (/  +  CA^B)-1  CA~\ 

This  is  known  as  the  matrix  inversion  lemma,  or  the  Sherman-  Woodbury-Morrison 
formula. 

The  matrix  inversion  lemma  has  many  applications.  For  example  if  p  is  small 
(or  even  just  not  very  large),  it  gives  us  a  method  for  solving  (A  +  BC)x  =  b, 
provided  we  have  an  efficient  method  for  solving  Au  =  v. 


X 

b 

y 

0 

(C.8) 
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Diagonal  or  sparse  plus  low  rank 

Suppose  that  A  is  diagonal  with  nonzero  diagonal  elements,  and  we  want  to  solve 
an  equation  of  the  form  (C.7).  The  straightforward  solution  would  consist  in  first 
forming  the  matrix  D  =  A  +  BC,  and  then  solving  Dx  =  b.  If  the  product  BC 
is  dense,  then  the  complexity  of  this  method  is  2 pn2  flops  to  form  A  +  BC,  plus 
(2/3 )n3  flops  for  the  LU  factorization  of  D ,  so  the  total  cost  is 

2  pn2  +  (2/3  )n3 

flops.  The  matrix  inversion  lemma  suggests  a  more  efficient  method.  We  can 
calculate  x  by  evaluating  the  expression  (C.9)  from  right  to  left,  as  follows.  We 
first  evaluate  z  =  A~1b  (n  flops,  since  A  is  diagonal).  Then  we  form  the  matrix 
E  =  I  +  CA~1B  (2 p2n  flops).  Next  we  solve  Ew  =  Cz,  which  is  a  set  of  p  linear 
equations  in  p  variables.  The  cost  is  (2/3)p3  flops,  plus  2 pn  to  form  Cz.  Finally, 
we  evaluate  x  =  z  —  A~1Bw  (2 pn  flops  for  the  matrix-vector  product  Bw,  plus 
lower  order  terms).  The  total  cost  is 

2  p2n  +  (2/3  )p3 

flops,  dropping  dominated  terms.  Comparing  with  the  first  method,  we  see  that 
the  second  method  is  more  efficient  when  p  <  n.  In  particular  if  p  is  small  and 
fixed,  the  complexity  grows  linearly  with  n. 

Another  important  application  of  the  matrix  inversion  lemma  occurs  when  A  is 
sparse  and  nonsingular,  and  the  matrices  B  and  C  are  dense.  Again  we  can  compare 
two  methods.  The  first  method  is  to  form  the  (dense)  matrix  A  +  BC,  and  to 
solve  (C.7)  using  a  dense  LU  factorization.  The  cost  of  this  method  is  2pn2+(2/3)n3 
flops.  The  second  method  is  based  on  evaluating  the  expression  (C.9),  using  a 
sparse  LU  factorization  of  A.  Specifically,  suppose  that  /  is  the  cost  of  factoring 
A  as  A  =  P1LUP2,  and  s  is  the  cost  of  solving  the  factored  system  P1LUP2X  =  d. 
We  can  evaluate  (C.9)  from  right  to  left  as  follows.  We  first  factor  A,  and  solve 
p  +  1  linear  systems 

Az  =  b,  AD  =  B, 

to  find  z  £  Rn,  and  D  £  R"xp.  The  cost  is  /  +  (p  +  l)s  flops.  Next,  we  form  the 
matrix  E  =  I  +  CD,  and  solve 

Ew  =  Cz, 

which  is  a  set  of  p  linear  equations  in  p  variables  w.  The  cost  of  this  step  is 
2 p2n  +  (2/3 )p3  plus  lower  order  terms.  Finally,  we  evaluate  x  =  z  —  Dw,  at  a  cost 
of  2 pn  flops.  This  gives  us  a  total  cost  of 

/  +  ps  +  2  p2n  +  (2/3  )p3 

flops.  If  f  <C  (2/3 )n3  and  s  <C  2 n2,  this  is  much  lower  than  the  complexity  of  the 
first  method. 


Remark  C.3  The  augmented  system  approach.  A  different  approach  to  exploiting 
sparse  plus  low  rank  structure  is  to  solve  (C.8)  directly  using  a  sparse  LU-solver.  The 
system  (C.8)  is  a  set  of  p  +  n  linear  equations  in  p  +  n  variables,  and  is  sometimes 
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called  the  augmented  system  associated  with  (C.7).  If  A  is  very  sparse  and  p  is  small, 
then  solving  the  augmented  system  using  a  sparse  solver  can  be  much  faster  than 
solving  the  system  (C.7)  using  a  dense  solver. 

The  augmented  system  approach  is  closely  related  to  the  method  that  we  described 
above.  Suppose 

A  =  P1LUP2 

is  a  sparse  LU  factorization  of  A.  and 

/  +  CA~1B  =  P3LU 

is  a  dense  LU  factorization  of  /  +  CA~1B.  Then 

\  A  B  1 


r  -Pi  o  i 

L  0 

'  u  tp1  pfn  ' 

P2  o' 
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-CO 

O 

_ i 

PZ'CPZ'U-1  -l 

o  u 

0  I 

and  this  factorization  can  be  used  to  solve  the  augmented  system.  It  can  be  verified 
that  this  is  equivalent  to  the  method  based  on  the  matrix  inversion  lemma  that  we 
described  above. 


Of  course,  if  we  solve  the  augmented  system  using  a  sparse  LU  solver,  we  have  no 
control  over  the  permutations  that  are  selected.  The  solver  might  choose  a  factor¬ 
ization  different  from  (C.10),  and  more  expensive  to  compute.  In  spite  of  this,  the 
augmented  system  approach  remains  an  attractive  option.  It  is  easier  to  implement 
than  the  method  based  on  the  matrix  inversion  lemma,  and  it  is  numerically  more 
stable. 


Low  rank  updates 

Suppose  A  £  R"xn  is  nonsingular,  u,  v  £  Rn  with  1  +  vTA~1u  ^  0,  and  we  want 
to  solve  two  sets  of  linear  equations 

Ax  =  b,  (A  +  uvT)x  =  b. 


The  solution  x  of  the  second  system  is  called  a  rank-one  update  of  x.  The  matrix 
inversion  lemma  allows  us  to  calculate  the  rank-one  update  x  very  cheaply,  once 
we  have  computed  x.  We  have 


x  = 


(A  +  uvt)  16 


1 


=  (A  -  1  ,  „,T  A-P.A  UV  A  )b 


+  vT  A~1u 
A~\ 


T 

V  X 


1  +  vT  A~l 


We  can  therefore  solve  both  systems  by  factoring  A ,  computing  x  =  A  1b  and 
w  =  A_1u,  and  then  evaluating 


x  =  x  — 


T 

V  X 

1  +  VTW 


w. 


The  overall  cost  is  /  +  2s,  as  opposed  to  2 (/  +  s)  if  we  were  to  solve  for  x  from 
scratch. 
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C.5  Solving  underdetermined  linear  equations 

To  conclude  this  appendix,  we  mention  a  few  important  facts  about  underdeter¬ 
mined  linear  equations 

Ax  =  b,  (C.ll) 

where  A  £  Rpx"  with  p  <  n.  We  assume  that  rank  A  =  p ,  so  there  is  at  least  one 
solution  for  all  b. 

In  many  applications  it  is  sufficient  to  find  just  one  particular  solution  x.  In 
other  situations  we  might  need  a  complete  parametrization  of  all  solutions  as 

{x\Ax  =  b}  =  {Fz  +  x\z£  R"_p}  (C.12) 

where  F  is  a  matrix  whose  columns  form  a  basis  for  the  nullspace  of  A. 

Inverting  a  nonsingular  submatrix  of  A 

The  solution  of  the  underdetermined  system  is  straightforward  if  a  p  x  p  nonsingular 
submatrix  of  A  is  known.  We  start  by  assuming  that  the  first  p  columns  of  A  are 
independent.  Then  we  can  write  the  equation  Ax  =  b  as 


Ax  =  A\  A2 


Xi 

X2 


A\X\  +  A2x  2  =  b , 


where  Ai  £  Rpxp  is  nonsingular.  We  can  express  x\  as 

Xi  =  A^(b  -  A2x2)  =  A^b  -  A^1A2x2. 


This  expression  allows  us  to  easily  calculate  a  solution:  we  simply  take  x2  =  0, 
X\  =  A^1b.  The  cost  is  equal  to  the  cost  of  solving  one  square  set  of  p  linear 
equations  A \X\  =  b. 

We  can  also  parametrize  all  solutions  of  Ax  =  b,  using  x2  £  Rn-p  as  a  free 
parameter.  The  general  solution  of  Ax  =  b  can  be  expressed  as 


Xi 

X2 


-a^a2 

I 


x2  + 


A^b 

0 


This  gives  a  parametrization  of  the  form  (C.12)  with 


F  = 


—Ax  1 A2 
I 


X  = 


A^b 

0 


To  summarize,  assume  that  the  cost  of  factoring  A1  is  /  and  the  cost  of  solving  one 
system  of  the  form  A \X  =  d  is  s.  Then  the  cost  of  finding  one  solution  of  (C.ll) 
is  /  +  s.  The  cost  of  parametrizing  all  solutions  ( i.e .,  calculating  F  and  x)  is 
f  +  s(n-p+  1). 

Now  we  consider  the  general  case,  when  the  first  p  columns  of  A  need  not  be 
independent.  Since  rank  A  =  p,  we  can  select  a  set  of  p  columns  of  A  that  is 
independent,  permute  them  to  the  front,  and  then  apply  the  method  described 
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above.  In  other  words,  we  find  a  permutation  matrix  P  such  that  the  first  p 
columns  of  A  =  AP  are  independent,  i.e., 

A  =  AP=[A1  A2], 


where  Ai  is  invertible, 
given  by 


The  general  solution  of  Ax  =  6,  where  x  =  PTx,  is  then 


Aflb  ' 
0 


The  general  solution  of  Ax  =  b  is  then  given  by 


x  =  Px  =  P 


—Ax  1 A2 


z  +  P 


where  z  £  R,i_p  is  a  free  parameter.  This  idea  is  useful  when  it  is  easy  to  identify 
a  nonsingular  or  easily  inverted  submatrix  of  A,  for  example,  a  diagonal  matrix 
with  nonzero  diagonal  elements. 


The  QR  factorization 

If  C  £  Rnxp  with  p  <  n  and  rankC  =  p ,  then  it  can  be  factored  as 

C=[Qi  Q2  ]  J  , 

where  Q\  £  R"xp  and  Q2  £  Rnx(n-P)  satisfy 

Q1Q1  =  I,  Q%Q2  =  I,  QiQ2  =  0, 

and  R  £  Rpxp  is  upper  triangular  with  nonzero  diagonal  elements.  This  is  called 
the  QR  factorization  of  C.  The  QR  factorization  can  be  calculated  in  2p2(n  —  p/3) 
flops.  (The  matrix  Q  is  stored  in  a  factored  form  that  makes  it  possible  to  efficiently 
compute  matrix-vector  products  Qx  and  QTx.) 

The  QR  factorization  can  be  used  to  solve  the  underdetermined  set  of  linear 
equations  (C.ll).  Suppose 

At={Q1  Q2]  q 

is  the  QR  factorization  of  AT .  Substituting  in  the  equations  it  is  clear  that  x  = 
QiRrTb  satisfies  the  equations: 

Ax  =  RTQT1Q1R-Tb  =  b. 

Moreover,  the  columns  of  Q2  form  a  basis  for  the  nullspace  of  A,  so  the  complete 
solution  set  can  be  parametrized  as 

{x  =  x  +  Q2z  I  z  £  R"~p}. 

The  QR  factorization  method  is  the  most  common  method  for  solving  under¬ 
determined  equations.  One  drawback  is  that  it  is  difficult  to  exploit  sparsity.  The 
factor  Q  is  usually  dense,  even  when  C  is  very  sparse. 
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LU  factorization  of  a  rectangular  matrix 

If  C  £  R"xp  with  p  <  n  and  rankC  =  p,  then  it  can  be  factored  as 

C  =  PLU 

where  P  £  Rnxn  js  a  permutation  matrix,  L  £  R"xp  is  unit  lower  triangular  (i.e., 
lij  =  0  for  i  <  j  and  la  =  1),  and  U  €  Rpxp  is  nonsingular  and  upper  triangular. 
The  cost  is  (2/3 )p3  +  p2(n  —  p)  flops  if  no  structure  in  C  is  exploited. 

If  the  matrix  C  is  sparse,  the  LU  factorization  usually  includes  row  and  column 
permutations,  i.e.,  we  factor  C  as 


c  =  p1lup2 


where  Pi,  P2  £  Rpxp  are  permutation  matrices.  The  LU  factorization  of  a  sparse 
rectangular  matrix  can  be  calculated  very  efficiently,  at  a  cost  that  is  much  lower 
than  for  dense  matrices. 

The  LU  factorization  can  be  used  to  solve  underdetermined  sets  of  linear  equa¬ 
tions.  Suppose  AT  =  PLU  is  the  LU  factorization  of  the  matrix  AT  in  (C.ll),  and 
we  partition  L  as 


L  = 


L\ 

L2 


where  L\  £  Rpxp  and  L2  £  R^n  p)xp.  It  is  easily  verified  that  the  solution  set  can 
be  parametrized  as  (C.12)  with 


'  L^TU~Tb  ' 

0 


F  =  P 


-L~xtL 


n 


i 


X  =  P 
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Notation 


Some  specific  sets 


R 

Rn 

pmxn 

R++ 

c 

Cn 

QvmXn 

z 

z+ 

sn 

cn  q  n 


Real  numbers. 

Real  n- vectors  (n  x  1  matrices). 

Real  to  x  n  matrices. 

Nonnegative,  positive  real  numbers. 

Complex  numbers. 

Complex  n-vectors. 

Complex  to  x  n  matrices. 

Integers. 

Nonnegative  integers. 

Symmetric  n  x  n  matrices. 

Symmetric  positive  semidefinite,  positive  definite,  n  x  n 
matrices. 


Vectors  and  matrices 


1 

i 

xT 

XH 

tr  X 

\{X) 

^max(-^)5  ^min(-^0 

<Ti(X) 

^max(-^))  ^min  (X) 

xt 
x  ±  y 

diag(a;) 
diag(X,  Y, . . .) 
rank  A 

TZ(A) 

A f{A) 


Vector  with  all  components  one. 
ith  standard  basis  vector. 

Identity  matrix. 

Transpose  of  matrix  X. 

Hermitian  (complex  conjugate)  transpose  of  matrix  X. 
Trace  of  matrix  X. 

ith  largest  eigenvalue  of  symmetric  matrix  X. 

Maximum,  minimum  eigenvalue  of  symmetric  matrix  X. 
ith  largest  singular  value  of  matrix  X. 

Maximum,  minimum  singular  value  of  matrix  X. 
Moore-Penrose  or  pseudo- inverse  of  matrix  X. 

Vectors  x  and  y  are  orthogonal:  xTy  =  0. 

Orthogonal  complement  of  subspace  V. 

Diagonal  matrix  with  diagonal  entries  xi, . . . ,  xn. 

Block  diagonal  matrix  with  diagonal  blocks  X,  Y, . . .. 
Rank  of  matrix  A. 

Range  of  matrix  A. 

Nullspace  of  matrix  A. 
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Norms  and  distances 


HI 

A  norm. 

II  •  II* 

Dual  of  norm  ||  •  ||. 

IM|2 

Euclidean  (or  £ 2 -)  norm  of  vector  x. 

Mi 

l\ -norm  of  vector  x. 

IMloo 

foo-norm  of  vector  x. 

m 

Spectral  norm  (maximum  singular  value)  of  matrix  X 

B(c,r) 

Ball  with  center  c  and  radius  r. 

dist(A,  B) 

Distance  between  sets  (or  points)  A  and  B. 

Generalized  inequalities 


xAy 
x<y 
X  <  Y 
X  -<  Y 


xdiK  y 
x  -<k  y 
x  -<k*  y 
x  - <K .  y 


Componentwise  inequality  between  vectors  x  and  y. 
Strict  componentwise  inequality  between  vectors  x  and  y 
Matrix  inequality  between  symmetric  matrices  X  and  Y . 
Strict  matrix  inequality  between  symmetric  matrices  X 
and  Y . 

Generalized  inequality  induced  by  proper  cone  K. 

Strict  generalized  inequality  induced  by  proper  cone  I\ . 
Dual  generalized  inequality. 

Dual  strict  generalized  inequality. 


Topology  and  convex  analysis 


card  C 

Cardinality  of  set  C. 

int  C 

Interior  of  set  C. 

relint  C 

Relative  interior  of  set  C. 

cl  C 

Closure  of  set  C. 

bdC 

Boundary  of  set  C :  bd  C  =  cl 

conv  C 

Convex  hull  of  set  C. 

aff  C 

Affine  hull  of  set  C. 

K* 

Dual  cone  associated  with  K. 

Ic 

Indicator  function  of  set  C. 

Sc 

Support  function  of  set  C. 

f* 

Conjugate  function  of  /. 

Probability 

EX 
prob  S 
var  X 

AT(c,S) 

$ 


Expected  value  of  random  vector  X. 

Probability  of  event  S. 

Variance  of  scalar  random  variable  X. 

Gaussian  distribution  with  mean  c,  covariance  (matrix)  E. 
Cumulative  distribution  function  of  Af( 0, 1)  random  vari¬ 
able. 
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Functions  and  derivatives 


f-.A^B 

f  is  a  function  on  the  set  dom  /  C  A  into  the  set  B 

dom  / 

Domain  of  function  /. 

epi  / 

Epigraph  of  function  /. 

v/ 

Gradient  of  function  /. 

V2/ 

Hessian  of  function  /. 

Df 

Derivative  (Jacobian)  matrix  of  function  /. 

Index 


A-optimal  experiment  design,  387 

abstract  form  convex  problem,  137 

active  constraint,  128 

activity  planning,  149,  195 

aff  (affine  hull),  23 

affine 

combination,  22 
dimension,  23 
function,  36 

composition,  79,  95,  508,  642,  645 
hull,  23 

independence,  32 
invariance,  486 

analytic  center,  449 
Newton  decrement,  487 
Newton  step,  527 
Newton’s  method,  494,  496 
self-concordance,  498 
set,  21 

separation  from  convex  set,  49 
algorithm,  see  method 
alignment  constraint,  442 
allocation 

asset,  155,  186,  209 
power,  196,  210,  212,  245 
resource,  253,  523,  559 
alternatives,  258,  285 

generalized  inequalities,  54,  269 
linear  discrimination,  423 
linear  inequalities,  50,  63 
linear  matrix  inequality,  287 
nonconvex  quadratic,  655 
strong,  260 
weak,  258 

amplitude  distribution,  294,  304 
analytic  center,  141,  419,  449,  458,  519,  535, 
541,  546,  547 
affine  invariance,  449 
dual,  276,  525 
efficient  line  search,  518 
ellipsoid,  420,  450 

linear  matrix  inequality,  422,  459,  508, 
553 

method,  626 
ML  interpretation,  450 
quadratic  inequalities,  519 


angle,  633 

approximation,  448 
constraint,  406 
problem,  405,  408 
anisotropy,  461 

approximate  Newton  method,  519 
approximation 

Chebyshev,  6,  293 
complex,  197 
fitting  angles,  448 
^i-norm,  193,  294 
least-squares,  293 
log-Chebyshev,  344 
matrix  norm,  194 
minimax,  293 
monomial,  199 
penalty  function,  294,  353 
regularized,  305 
residual,  291 
robust,  318 
sparse,  333 
total  variation,  312 
variable  bounds,  301 
width,  121 
with  constraints,  301 
arbitrage-free  price,  263 
arithmetic  mean,  75 

arithmetic-geometric  mean  inequality,  78 
arrow  matrix,  670 
asymptotic  cone,  66 

backtracking  line  search,  464 
backward  substitution,  666 
ball,  30,  634 

Euclidean,  29 

banded  matrix,  510,  546,  553,  669,  675 

bandwidth  allocation,  210 

barrier 

cone,  66 
function,  563 
method,  568 

complexity,  585,  595 
convergence  analysis,  577 
convex- concave  game,  627 
generalized  inequalities,  596,  601,  605 
infeasible  start,  571 
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Index 


linear  program,  616 
second-order  cone  program,  601 
semidefinite  program,  602,  618 
basis,  405 

dictionary,  333 
dual,  407 
functions,  326 
Lagrange,  326 
over-complete,  333 
pursuit,  310,  333,  580 
well  conditioned,  407 
Bayesian 

classification,  428 
detector,  367 
estimation,  357 
bd  (boundary),  50,  638 
best  linear  unbiased  estimator,  176 
binary  hypothesis  testing,  370 
bisection  method,  249,  430 

quasiconvex  optimization,  146 
BLAS,  684 
block 

elimination,  546,  554,  672 
LU  factorization,  673 
matrix  inverse,  650 
separable,  552 
tridiagonal,  553 
Boolean  linear  program 

Lagrangian  relaxation,  276 
LP  relaxation,  194 
boundary,  638 
bounding  box,  433 
bounds 

Chebyshev,  150,  374 
Chernoff,  379 

convex  function  values,  338 
correlation  coefficients,  408 
expected  values,  361 
for  global  optimization,  11 
probabilities,  361 
box  constraints,  129 

cantilever  beam,  163,  199 

capacity  of  communication  channel,  207 

card  (cardinality),  98 

£i-norm  heuristic,  310 
Cauchy-Schwartz  inequality,  633 
ceiling,  96 
center 

analytic,  419 
Chebyshev,  148,  416 
maximum  volume  ellipsoid,  418 
central  path,  564 
duality,  565 

generalized  inequalities,  598 
KKT  conditions,  567 
predictor-corrector,  625 


second-order  cone  programming,  599 
semidefinite  programming,  600 
tangent,  624 
certificate 

infeasibility,  259,  582 
suboptimality,  241,  568 
chain  rule,  642 

second  derivative,  645 
change  of  variable,  130 
Chebyshev 

approximation,  6,  293 

lower  bounds  via  least-squares,  274 
robust,  323 
bounds,  150,  374 
center,  148,  416 
inequalities,  150,  154 
norm,  635 

Chernoff  bounds,  379 

Cholesky  factorization,  118,  406,  509,  546, 
617,  669 

banded  matrix,  670 
sparse  matrix,  670 
circuit  design,  2,  17,  432,  446 
cl  (closure),  638 
classification,  422 
Bayesian,  428 
linear,  423 
logistic,  427 
nonlinear,  429 
polynomial,  430 
quadratic,  429 
support  vector,  425 
closed 

function,  458,  529,  577,  639 
set,  637 

sublevel  set  assumption,  457,  529 
closure,  638 
combination 
affine,  22 
conic,  25 
convex,  24 

communication  channel 
capacity,  207 
dual,  279 

power  allocation,  245 
complementary  slackness,  242 
generalized  inequalities,  267 
complex 

norm  approximation,  197 
semidefinite  program,  202 
complexity 

barrier  method,  585 

generalized  inequalities,  605 
linear  equations,  662 
second-order  cone  program,  606 
semidefinite  program,  608 
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componentwise  inequality,  32,  43 
composition,  83 

affine  function,  508,  642,  645 
quasiconvexity,  102 
self-concordance,  499 
concave 

function,  67 

maximization  problem,  137 
cond  (condition  number),  649 
condition  number,  203,  407,  649 
ellipsoid,  461 
gradient  method,  473 
Newton’s  method,  495 
set,  461 

conditional  probability,  42,  357,  428 
cone,  25 

barrier,  66 
dual,  51 
Euclidean,  449 
hyperbolic,  39 
in  R2,  64 
lexicographic,  64 
Lorentz,  31 
moments,  66 

monotone  nonnegative,  64 
normal,  66 
pointed,  43 

positive  semidefinite,  34,  64 
program,  168 
dual,  266 
proper,  43 
recession,  66 
second-order,  31,  449 
separation,  66 
solid,  43 

conic 

combination,  25 
form  problem,  168,  201 
hull,  25 
conjugate 

and  Lagrange  dual,  221 
function,  90 

self-concordance,  517 
logarithm,  607 
constraint 

active,  128 
box,  129 
explicit,  134 
hyperbolic,  197 
implicit,  134 
kinematic,  247 
qualifications,  226 
redundant,  128 
set,  127 

consumer  preference,  339 
continuous  function,  639 


control 

model  predictive,  17 
optimal,  194,  303,  552 
conv  (convex  hull),  24 
convergence 

infeasible  Newton  method,  536 
linear,  467 
Newton  method,  529 
quadratic,  489,  539 
convex 

combination,  24 
cone,  25 

equality  constraints,  191 
function,  67 
bounded,  114 
bounding  values,  338 
first-order  conditions,  69 
interpolation,  337 
inverse,  114 
level  set,  113 

over  concave  function,  103 
product,  119 
geometric  program,  162 
hull,  24 

function,  119 
minimizing  over,  207 
optimization,  2,  7,  136 
abstract  form,  137 
set,  23 

image  under  linear-fractional  func¬ 
tion,  62 

separating  hyperplane,  403 
separation  from  affine  set,  49 
convex-concave,  238 

fractional  problem,  191 
function,  115 

saddle-point  property,  281 
game,  540,  542,  560 
barrier  method,  627 
bounded  inverse  derivative  condition, 
559 

Newton  method,  540 
Newton  step,  559 
convexity 

first-order  condition,  69 
matrix,  110 
midpoint,  60 

second-order  conditions,  71 
strong,  459,  558 
coordinate  projection,  38 
copositive  matrix,  65,  202 
correlation  coefficient,  406 
bounding,  408 
cost,  127 

random,  154 
risk-sensitive,  155 
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covariance 

estimation,  355 
estimation  error,  384 
incomplete  information,  171 
covering  ellipsoid,  275 
cumulant  generating  function,  106 
cumulative  distribution,  107 
log-concavity,  124 
curve 

minimum  length  piecewise-linear,  547 
optimal  trade-off,  182 
piecewise-arc,  453 

D-optimal  experiment  design,  387 
damped  Newton  step,  489 
data  fitting,  2,  291 
de-noising,  310 

deadzone-linear  penalty  function,  295,  434 
dual,  345 
decomposition 

eigenvalue,  646 
generalized  eigenvalue,  647 
orthogonal,  646 
singular  value,  648 
deconvolution,  307 
degenerate  ellipsoid,  30 
density  function 

log-concave,  104,  124,  352 
depth,  416 
derivative,  640 

chain  rule,  642 
directional,  642 
pricing,  264 
second,  643 
descent 

direction,  463 
feasible,  527 
method,  463 
gradient,  466 
steepest,  475 

design 

circuit,  2,  17,  432,  446 
detector,  364 
of  experiments,  384 
optimal,  292,  303 
detector 

Bayes,  367 
design,  364 
MAP,  369 
minimax,  367 
ML,  369 
randomized,  365 
robust,  372 
determinant,  73 

derivative,  641 
device  sizing,  2 

diagonal  plus  low  rank,  511,  678 


diagonal  scaling,  163 
dictionary,  333 
diet  problem,  148 
differentiable  function,  640 
directional  derivative,  642 
Dirichlet  density,  124 
discrete  memoryless  channel,  207 
discrimination,  422 
dist  (distance),  46,  634 
distance,  46,  634 

between  polyhedra,  154,  403 
between  sets,  402 
constraint,  443 
maximum  probability,  118 
ratio  function,  97 
to  farthest  point  in  set,  81 
to  set,  88,  397 
distribution 

amplitude,  294 
Gaussian,  104 
Laplacian,  352 
maximum  entropy,  362 
Poisson,  353 
Wishart,  105 
dom  (domain),  639 
domain 

function,  639 
problem,  127 

dual 

basis,  407 
cone,  51 

logarithm,  607 
properties,  64 
feasibility  equations,  521 
feasible,  216 
function,  216 

geometric  interpretation,  232 
generalized  inequality,  53 

characterization  of  minimal  points, 
54 

least-squares,  218 
logarithm,  607 
Newton  method,  557 
norm,  93,  637 
problem,  223 
residual,  532 
spectral  norm,  637 
stopping  criterion,  242 
variable,  215 
duality,  215 

central  path,  565 
game  interpretation,  239 
gap,  241 

optimal,  226 
surrogate,  612 

multicriterion  interpretation,  236 
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price  interpretation,  240 
saddle-point  interpretation,  237 
strong,  226 
weak,  225 

dynamic  activity  planning,  149 

F-optimal  experiment  design,  387 
eccentricity,  461 
a  (ith  unit  vector),  33 
eigenvalue 

decomposition,  646 
generalized,  647 
interlacing  theorem,  122 
maximum,  82,  203 
optimization,  203 
spread,  203 
sum  of  k  largest,  118 
electronic  device  sizing,  2 
elementary  symmetric  functions,  122 
elimination 

banded  matrix,  675 
block,  546 
constraints,  132 
equality  constraints,  523,  542 
variables,  672 
ellipsoid,  29,  39,  635 

condition  number,  461 
covering,  275 
degenerate,  30 
intersection,  262 
Lowner-John,  410 
maximum  volume,  414 
minimum  volume,  410 
separation,  197 
via  analytic  center,  420 
volume,  407 

embedded  optimization,  3 
entropy,  72,  90,  117 

maximization,  537,  558,  560,  562 
dual  function,  222 
self-concordance,  497 
epigraph,  75 

problem,  134 
equality 

constrained  minimization,  521 
constraint,  127 
convex,  191 

elimination,  132,  523,  542 
equations 

KKT,  243 
normal,  458 
equivalent 

norms,  636 
problems,  130 
estimation,  292 
Bayesian,  357 
covariance,  355 


least-squares,  177 
linear  measurements,  352 
maximum  a  posteriori,  357 
noise  free,  303 

nonparametric  distribution,  359 
statistical,  351 
Euclidean 
ball,  29 
distance 
matrix,  65 
problems,  405 
norm,  633 

projection  via  pseudo-inverse,  649 
exact  line  search,  464 
exchange  rate,  184 
expanded  set,  61 
experiment  design,  384 
A-optimal,  387 
D-optimal,  387 
dual,  276 
-E-optimal,  387 
explanatory  variables,  353 
explicit  constraint,  134 
exponential,  71 

distribution,  105 
matrix,  110 

extended-value  extension,  68 

extrapolation,  333 

extremal  volume  ellipsoids,  410 

facility  location  problem,  432 
factor-solve  method,  666 
factorization 

block  LU,  673 
Cholesky,  118,  546,  669 
LDLt,  671 
LU,  668 
QR,  682 
symbolic,  511 
Farkas  lemma,  263 
fastest  mixing  Markov  chain,  173 
dual,  286 
feasibility 

methods,  579 
problem,  128 
feasible,  127 

descent  direction,  527 
dual,  216 
point,  127 
problem,  127 
set,  127 

Fenchel’s  inequality,  94 
first-order 

approximation,  640 
condition 

convexity,  69 
monotonicity,  109 
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quasiconvexity,  99,  121 

fitting 

minimum  norm,  331 
polynomial,  331 
spline,  331 
floor  planning,  438 

geometric  program,  444 
flop  count,  662 
flow 

optimal,  193,  550,  619 
forward  substitution,  665 
fractional  program 
generalized,  205 
Frobenius  norm,  634 
scaling,  163,  478 
fuel  use  map,  194,  213 
function 

affine,  36 
barrier,  563 

closed,  458,  529,  577,  639 
composition,  83 
concave,  67 
conjugate,  90,  221 
continuous,  639 
convex,  67 
convex  hull,  119 
convex-concave,  115 
derivative,  640 
differentiable,  640 
domain,  639 
dual,  216 

elementary  symmetric,  122 
extended-value  extension,  68 
first-order  approximation,  640 
fitting,  324 
gradient,  641 
Huber,  345 

interpolation,  324,  329 
Lagrange  dual,  216 
Lagrangian,  215 
Legendre  transform,  95 
likelihood,  351 
linear-fractional,  41 
log  barrier,  563 
log-concave,  104 
log-convex,  104 
matrix  monotone,  108 
monomial,  160 
monotone,  115 
notation,  14,  639 
objective,  127 
penalty,  294 
perspective,  39,  89,  117 
piecewise-linear,  119,  326 
pointwise  maximum,  80 
posynomial,  160 


projection,  397 
projective,  41 
quasiconvex,  95 
quasilinear,  122 
self-concordant,  497 
separable,  249 
support,  63 
unimodal,  95 
utility,  115,  211,  339 

game,  238 

advantage  of  going  second,  240 
barrier  method,  627 
bounded  inverse  condition,  559 
continuous,  239 
convex-concave,  540,  542,  560 
Newton  step,  559 
duality,  231 

duality  interpretation,  239 
matrix,  230 
gamma  function,  104 
log-convexity,  123 
Gauss-Newton  method,  520 
Gaussian  distribution 

log-concavity,  104,  123 
generalized 

eigenvalue 

decomposition,  647 
minimization,  204 
quasiconvexity,  102 
fractional  program,  205 
geometric  program,  200 
inequality,  43 

barrier  method,  596,  601 
central  path,  598 
dual,  53,  264 
log  barrier,  598 
logarithm,  597 
optimization  problem,  167 
theorem  of  alternatives,  269 
linear-fractional  program,  152 
logarithm,  597 
dual,  607 

positive  semidefinite  cone,  598 
second-order  cone,  597 
posynomial,  200 
geometric 

mean,  73,  75 
conjugate,  120 
maximizing,  198 
program,  160,  199 
barrier  method,  573 
convex  form,  162 
dual,  256 

floor  planning,  444 
sensitivity  analysis,  284 
unconstrained,  254,  458 
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global  optimization,  10 
bounds,  11 

GP,  see  geometric  program 
gradient,  641 

conjugate,  121 
log  barrier,  564 
method,  466 

and  condition  number,  473 
projected,  557 
Gram  matrix,  405 

halfspace,  27 

Voronoi  description,  60 
Hankel  matrix,  65,  66,  170,  204 
harmonic  mean,  116,  198 
log-concavity,  122 
Hessian,  71,  643 
conjugate,  121 
Lipschitz  continuity,  488 
log  barrier,  564 
sparse,  511 

Holder’s  inequality,  78 

Huber  penalty  function,  190,  299,  345 

hull 

affine,  23 
conic,  25 
convex,  24 
hybrid  vehicle,  212 
hyperbolic 
cone,  39 
constraint,  197 
set,  61 

hyperplane,  27 

separating,  46,  195,  423 
supporting,  50 
hypothesis  testing,  364,  370 

IID  noise,  352 
implementation 

equality  constrained  methods,  542 
interior-point  methods,  615 
line  search,  508 
Newton’s  method,  509 
unconstrained  methods,  508 
implicit  constraint,  134 
Lagrange  dual,  257 
indicator  function,  68,  92 

linear  approximation,  218 
projection  and  separation,  401 
induced  norm,  636 
inequality 

arithmetic-geometric  mean,  75,  78 
Cauchy-Schwartz,  633 
Chebyshev,  150,  154 
componentwise,  32,  43 
constraint,  127 
Fenchel’s,  94 


form  linear  program,  147 
dual,  225 
generalized,  43 
Holder’s,  78 
information,  115 
Jensen’s,  77 
matrix,  43,  647 
triangle,  634 
Young’s,  94,  120 
inexact  line  search,  464 
infeasibility  certificate,  259 
infeasible 

barrier  method,  571 
Newton  method,  531,  534,  558 
convergence  analysis,  536 
phase  I,  582 
problem,  127 

weak  duality,  273 
infimum,  638 

information  inequality,  115 
inner  product,  633 
input  design,  307 
interior,  637 
relative,  23 
interior-point 

method,  561 

implementation,  615 
primal-dual  method,  609 
internal  rate  of  return,  97 
interpolation,  324,  329 
least-norm,  333 
with  convex  function,  337 
intersection 

ellipsoids,  262 
sets,  36 

int  (interior),  637 
inverse 

convex  function,  114 
linear-fractional  function,  62 
investment 

log-optimal,  559 
return,  208 

IRR  (internal  rate  of  return),  97 

Jacobian,  640 
Jensen’s  inequality,  77 

quasiconvex  function,  98 

Karush-Kuhn- Tucker,  see  KKT 
kinematic  constraints,  247 
KKT 

conditions,  243 
central  path,  567 
generalized  inequalities,  267 
mechanics  interpretation,  246 
modified,  577 
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supporting  hyperplane  interpretation, 
283 

matrix,  522 

bounded  inverse  assumption,  530 
nonsingularity,  523,  547 
system,  677 

nonsingularity,  557 
solving,  542 

Kullback-Leibler  divergence,  90,  115,  362 
£i-norm 

approximation,  294,  353,  514 
barrier  method,  617 
regularization,  308 
steepest  descent  method,  477 
Lagrange 

basis,  326 
dual  function,  216 
dual  problem,  223 
multiplier,  215 

contact  force  interpretation,  247 
price  interpretation,  253 
Lagrangian,  215 

relaxation,  276,  654 
LAPACK,  684 
Laplace  transform,  106 
Laplacian  distribution,  352 
LDLt  factorization,  671 
least-norm 

interpolation,  333 
problem,  131,  302 
least-penalty  problem,  304 

statistical  interpretation,  359 
least-squares,  4,  131,  153,  177,  293,  304,  458 
convex  function  fit,  338 
cost  as  function  of  weights,  81 
dual  function,  218 
regularized,  184,  205 
robust,  190,  300,  323 
strong  duality,  227 
Legendre  transform,  95 
length,  96,  634 
level  set 

convex  function,  113 
lexicographic  cone,  64 
likelihood  function,  351 
likelihood  ratio  test,  371 
line,  21 

search,  464,  514 
backtracking,  464 
exact,  464 

implementation,  508 
pre-computation,  518 
primal-dual  interior-point  method,  612 
segment,  21 
linear 

classification,  423 


convergence,  467 
discrimination,  423 
equality  constraint 
eliminating,  132 
equations 
banded,  669 
block  elimination,  672 
easy,  664 

factor-solve  method,  666 
KKT  system,  677 
LAPACK,  684 
least-squares,  304 
low  rank  update,  680 
lower  triangular,  664 
multiple  right  hand  sides,  667 
Newton  system,  510 
orthogonal,  666 
Schur  complement,  672 
software,  684 
solution  set,  22 
solving,  661 
sparse  solution,  304 
symmetric  positive  definite,  669 
underdetermined,  681 
upper  triangular,  665 
estimation,  292 
best  unbiased,  176 
facility  location,  432 
inequalities 

alternative,  261 
analytic  center,  458 
log-barrier,  499 
solution  set,  27,  31 
theorem  of  alternatives,  50,  54 
matrix  inequality,  38,  76,  82 
alternative,  270 

analytic  center,  422,  459,  508,  553 
multiple,  169 
strong  alternatives,  287 
program,  1,  6,  146 

barrier  method,  571,  574 
Boolean,  194,  276 
central  path,  565 
dual,  224,  274 
dual  function,  219 
inequality  form,  147 
primal-dual  interior-point  method,  613 
random  constraints,  157 
random  cost,  154 
relaxation  of  Boolean,  194 
robust,  157,  193,  278 
standard  form,  146 
strong  duality,  227,  280 
separation 
ellipsoids,  197 
linear- fractional 
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function,  41 

composition,  102 
image  of  convex  set,  62 
inverse,  62 
quasiconvexity,  97 
program,  151 
generalized,  152 

linearized  optimality  condition,  485 
LMI,  see  linear  matrix  inequality 
locally  optimal,  9,  128,  138 
location,  432 
log  barrier,  563 

generalized  inequalities,  597,  598 
gradient  and  Hessian,  564 
linear  inequalities,  499 
linear  matrix  inequality,  459 
penalty  function,  295 
log-Chebyshev  approximation,  344,  629 
log-concave 

density,  104,  352 
function,  104 
log-convex  function,  104 
log-convexity 

Perron-Frobenius  eigenvalue,  200 
second-order  conditions,  105 
log-determinant,  499 
function,  73 
gradient,  641 
Hessian,  644 

log-likelihood  function,  352 
log-optimal  investment,  209,  559 
log-sum-exp 

function,  72,  93 
gradient,  642 
logarithm,  71 
dual,  607 

generalized  inequality,  597 
self-concordance,  497 
logistic 

classification,  427 
function,  122 
model,  210 
regression,  354 
Lorentz  cone,  31 
low  rank  update,  680 
lower  triangular  matrix,  664 
Lowner-John  ellipsoid,  410 
LP,  see  linear  progam 
4-norm,  635 
dual,  637 

LU  factorization,  668 

manufacturing  yield,  211 

MAP,  see  maximum  a  posteriori  probability 

Markov  chain 

equilibrium  distribution,  285 
estimation,  394 


fastest  mixing,  173 
dual,  286 

Markowitz  portfolio  optimization,  155 
matrix 

arrow,  670 

banded,  510,  546,  553,  669,  675 
block  inverse,  650 
completion  problem,  204 
condition  number,  649 
convexity,  110,  112 
copositive,  65,  202 
detection  probabilities,  366 
diagonal  plus  low  rank,  511,  678 
Euclidean  distance,  65 
exponential,  110 
factorization,  666 
fractional  function,  76,  82,  89 
fractional  minimization,  198 
game,  230 
Gram,  405 

Hankel,  65,  66,  170,  204 
Hessian,  643 
inequality,  43,  647 
inverse 

matrix  convexity,  124 
inversion  lemma,  515,  678 
KKT,  522 

nonsingularity,  557 
low  rank  update,  680 
minimal  upper  bound,  180 
monotone  function,  108 
multiplication,  663 
node  incidence,  551 
nonnegative,  165 
nonnegative  definite,  647 
norm,  82 

approximation,  194 
minimization,  169 
orthogonal,  666 

Po,  202 

permutation,  666 
positive  definite,  647 
positive  semidehnite,  647 
power,  110,  112 
pseudo-inverse,  649 
quadratic  function,  111 
sparse,  511 
square-root,  647 
max  function,  72 
conjugate,  120 
max-min 

inequality,  238 
property,  115,  237 
max-row-sum  norm,  194,  636 
maximal  element,  45 
maximization  problem,  129 
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concave,  137 
maximum 

a  posteriori  probability  estimation,  357 
determinant  matrix  completion,  204 
eigenvalue,  82,  203 
element,  45 
entropy,  254,  558 
distribution,  362 
dual,  248 

strong  duality,  228 
likelihood 
detector,  369 
estimation,  351 
probability  distance,  118 
singular  value,  82,  649 
dual,  637 
minimization,  169 
norm,  636 
volume 

ellipsoid,  414 
rectangle,  449,  629 

mean 

harmonic,  116 
method 

analytic  centers,  626 
barrier,  568 
bisection,  146 
descent,  463 
factor-solve,  666 
feasibility,  579 
Gauss-Newton,  520 
infeasible  start  Newton,  534 
interior-point,  561 
local  optimization,  9 
Newton’s,  484 
phase  I,  579 
primal-dual,  609 
randomized,  11 
sequential  unconstrained  m 
569 

steepest  descent,  475 
midpoint  convexity,  60 
minimal 

element,  45 

via  dual  inequalities,  54 
surface,  159 
minimax 

angle  fitting,  448 
approximation,  293 
detector,  367 
minimization 

equality  constrained,  521 
minimizing 

sequence,  457 
minimum 

element,  45 


via  dual  inequalities,  54 
fuel  optimal  control,  194 
length  piecewise-linear  curve,  547 
norm 

fitting,  331 
singular  value,  649 

variance  linear  unbiased  estimator,  176 
volume  ellipsoid 
dual,  222,  228 
Minkowski  function,  119 
mixed  strategy  matrix  game,  230 
ML,  see  maximum  likelihood 
model  predictive  control,  17 
moment,  66 

bounds,  170 
function 

log-concavity,  123 
generating  function,  106 
multidimensional,  204 
monomial,  160 

approximation,  199 
monotone 

mapping,  115 
nonnegative  cone,  64 
vector  function,  108 
monotonicity 

hrst-order  condition,  109 
Moore-Penrose  inverse,  649 
Motzkin’s  theorem,  447 
multicriterion 

detector  design,  368 
optimization,  181 
problem,  181 

scalarization,  183 
multidimensional  moments,  204 
multiplier,  215 
mutual  information,  207 


A f  (nullspace),  646 
■>  network 

optimal  flow,  193,  550 
rate  optimization,  619,  628 
Newton 

decrement,  486,  515,  527 
infeasible  start  method,  531 
method,  484 

affine  invariance,  494,  496 
approximate,  519 
convergence  analysis,  529,  536 
convex-concave  game,  540 
dual,  557 

equality  constraints,  525,  528 
implementing,  509 
infeasible,  558 
self-concordance,  531 
trust  region,  515 
step 
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affine  invariance,  527 
equality  constraints,  526 
primal-dual,  532 
system,  510 

Neyman-Pearson  lemma,  371 
node  incidence  matrix,  551 
nonconvex 

optimization,  9 
quadratic  problem 
strong  duality,  229 
nonlinear 

classification,  429 
facility  location  problem,  434 
optimization,  9 
programming,  9 
nonnegative 

definite  matrix,  647 
matrix,  165 
orthant,  32,  43 
minimization,  142 
polynomial,  44,  65 

nonparametric  distribution  estimation,  359 
norm,  72,  93,  634 

approximation,  291 
by  quadratic,  636 
dual,  254 
dual  function,  221 
weighted,  293 
ball,  30 
cone,  31 
dual,  52 
conjugate,  93 
dual,  637 
equivalence,  636 
Euclidean,  633 
Frobenius,  634 
induced,  636 
matrix,  82 
max-row-sum,  636 
maximum  singular  value,  636 
operator,  636 
quadratic,  635 

approximation,  413 
spectral,  636 
sum-absolute-value,  635 
normal 

cone,  66 
distribution 

log-concavity,  104 
equations,  458,  510 
vector,  27 

normalized  entropy,  90 
nuclear  norm,  637 
nullspace,  646 

objective  function,  127 
open  set,  637 


operator  norm,  636 
optimal 

activity  levels,  195 
allocation,  523 
consumption,  208 
control,  194,  303,  552 
hybrid  vehicle,  212 
minimum  fuel,  194 
design,  292,  303 
detector  design,  364 
duality  gap,  226 
input  design,  307 
Lagrange  multipliers,  223 
locally,  9 
network  flow,  550 
Pareto,  57 
point,  128 
local,  138 

resource  allocation,  559 
set,  128 

trade-off  analysis,  182 
value,  127,  175 

bound  via  dual  function,  216 
optimality 

conditions,  241 

generalized  inequalities,  266 
KKT,  243 
linearized,  485,  526 
optimization 
convex,  7 
embedded,  3 
global,  10 
local,  9 

multicriterion,  181 
nonlinear,  9 
over  polynomials,  203 
problem,  127 

epigraph  form,  134 
equivalent,  130 
feasibility,  128 
feasible,  127 

generalized  inequalities,  167 
maximization,  129 
optimal  value,  127 
perturbation  analysis,  249,  250 
sensitivity  analysis,  250 
standard  form,  127 
symmetry,  189 
recourse,  211,  519 
robust,  208 
two-stage,  211,  519 
variable,  127 
vector  objective,  174 
optimizing 

over  some  variables,  133 
option  pricing,  285 
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oracle  problem  description,  136 
ordering 

lexicographic,  64 
orthogonal 

complement,  27 
decomposition,  646 
matrix,  666 
outliers,  298 

outward  normal  vector,  27 
over-complete  basis,  333 

parameter  problem  description,  136 
parametric  distribution  estimation,  351 
Pareto  optimal,  57,  177,  206 
partial 

ordering  via  cone,  43 
sum,  62 

partitioning  problem,  219,  629 
dual,  226 
dual  function,  220 
eigenvalue  bound,  220 
semidefinite  program  relaxation,  285 
pattern  recognition,  422 
penalty  function 

approximation,  294 
deadzone-linear,  295 
Huber,  299 
log  barrier,  295 
robust,  299,  343 
statistical  interpretation,  353 
permutation  matrix,  666 
Perron-Frobenius  eigenvalue,  165 
log-convexity,  200 
perspective,  39,  89,  117 
conjugate,  120 
function,  207 
image  of  polyhedron,  62 
perturbed  optimization  problem,  250 
phase  I  method,  579 
complexity,  592 
infeasible  start,  582 
sum  of  infeasibilities,  580 
piecewise 
arc,  453 

polynomial,  327 
piecewise-linear 
curve 

minimum  length,  547 
function,  80,  119,  326 
conjugate,  120 
minimization,  150,  562 
dual,  275 

pin-hole  camera,  39 
placement,  432 

quadratic,  434 
point 

minimal,  45 


minimum,  45 
pointed  cone,  43 
pointwise  maximum,  80 
Poisson  distribution,  353 
polyhedral  uncertainty 

robust  linear  program,  278 
polyhedron,  31,  38 

Chebyshev  center,  148,  417 
convex  hull  description,  34 
distance  between,  154,  403 
Euclidean  projection  on,  398 
image  under  perspective,  62 
volume,  108 
Voronoi  description,  60 
polynomial 

classification,  430 
fitting,  326,  331 
interpolation,  326 
log-concavity,  123 
nonnegative,  44,  65,  203 
piecewise,  327 
positive  semidefinite,  203 
sum  of  squares,  203 
trigonometric,  116,  326 
polytope,  31 
portfolio 

bounding  risk,  171 
diversification  constraint,  279 
log-optimal,  209 
loss  risk  constraints,  158 
optimization,  2,  155 
risk-return  trade-off,  185 
positive 

definite  matrix,  647 
semidefinite 

cone,  34,  36,  64 
matrix,  647 

matrix  completion,  204 
polynomial,  203 
posynomial,  160 

generalized,  200 
two-term,  200 
power  allocation,  196 

broadcast  channel,  210 
communication  channel,  210 
hybrid  vehicle,  212 
power  function,  71 
conjugate,  120 
log-concavity,  104 

pre-computation  for  line  search,  518 
predictor-corrector  method,  625 
preference  relation,  340 
present  value,  97 
price,  57 

arbitrage-free,  263 
interpretation  of  duality,  240 
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option,  285 
shadow,  241 
primal  residual,  532 
primal-dual 

method,  609 

geometric  program,  613 
linear  program,  613 
Newton  step,  532 
search  direction,  609 
probability 

conditional,  42 
distribution 
convex  sets,  62 
maximum  distance,  118 
simplex,  33 
problem 

conic  form,  168 
control,  303 
convex,  136 
data,  136 
dual,  223 

equality  constrained,  521 
estimation,  292 

Euclidean  distance  and  angle,  405 
floor  planning,  438 
Lagrange  dual,  223 
least-norm,  302 
least-penalty,  304 
location,  432 
matrix  completion,  204 
maximization,  129 
multicriterion,  181 
norm  approximation,  291 
optimal  design,  292,  303 
partitioning,  629 
placement,  432 
quasiconvex,  137 
regression,  291 
regressor  selection,  310 
unbounded  below,  128 
unconstrained,  457 
unconstrained  quadratic,  458 
product 

convex  functions,  119 
inner,  633 

production  frontier,  57 
program 

geometric,  160 
linear,  146 
quadratic,  152 

quadratically  constrained  quadratic,  152 
semidefinite,  168,  201 
projected  gradient  method,  557 
projection 

coordinate,  38 
Euclidean,  649 


function,  397 

indicator  and  support  function,  401 
on  affine  set,  304 
on  set,  397 
on  subspace,  292 
projective  function,  41 
proper  cone,  43 

PSD  (positive  semidefinite),  203 
pseudo-inverse,  88,  141,  153,  177,  185,  305, 
649 

QCQP  (quadratically  constrained  quadratic 
program),  152 

QP  (quadratic  program),  152 
QR  factorization,  682 
quadratic 

convergence,  489,  539 
discrimination,  429 
function 

convexity,  71 
gradient,  641 
Hessian,  644 
minimizing,  140,  514 
inequalities 

analytic  center,  519 
inequality 

solution  set,  61 
matrix  function,  111 
minimization,  458,  649 
equality  constraints,  522 
norm,  635 

approximation,  636 
norm  approximation,  413 
optimization,  152,  196 
placement,  434 
problem 

strong  duality,  229 
program,  152 

primal-dual  interior-point  method,  630 
robust,  198 
smoothing,  312 

quadratic-over-linear  function,  72,  76 
minimizing,  514 

quadratically  constrained  quadratic  program, 
152,  196 

strong  duality,  227 
quartile,  62,  117 
quasi-Newton  methods,  496 
quasiconvex 

function,  95 

convex  representation,  103 
first-order  conditions,  99,  121 
Jensen’s  inequality,  98 
second-order  conditions,  101 
optimization,  137 

via  convex  feasibility,  145 
quasilinear  function,  122 
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1Z  (range),  645 
R  (reals),  14 

R+  (nonnegative  reals),  14 
R++  (positive  reals),  14 
R"  (nonnegative  orthant),  32 
randomized 

algorithm,  11 
detector,  365,  395 
strategy,  230 
range,  645 
rank,  645 

quasiconcavity,  98 
ratio  of  distances,  97 
recession  cone,  66 
reconstruction,  310 
recourse,  211,  519 
rectangle,  61 

maximum  volume,  449,  629 
redundant  constraint,  128 
regression,  153,  291 
logistic,  354 
robust,  299 
regressor,  291 

selection,  310,  334 
regularization,  5 
h,  308 

smoothing,  307 
Tikhonov,  306 
regularized 

approximation,  305 
least-squares,  184,  205 
relative 

entropy,  90 
interior,  23 

positioning  constraint,  439 
residual,  291 

amplitude  distribution,  296 
dual,  532 
primal,  532 

resource  allocation,  559 
restricted  set,  61 
Riccati  recursion,  553 
Riesz-Fejer  theorem,  348 
risk-return  trade-off,  185 
risk-sensitive  cost,  155 
robust 

approximation,  318 
Chebyshev  approximation,  323 
detector,  372 

least-squares,  190,  300,  323 
linear  discrimination,  424 
linear  program,  157,  193,  278 
optimization,  208 
penalty  function,  299,  343 
quadratic  program,  198 
regression,  299 


S”  (symmetric  n  x  n  matrices),  34 
standard  inner  product,  633 
S"  (positive  semidefinite  n  x  n  matrices), 

34 

saddle-point,  115 

convex-concave  function,  281 
duality  interpretation,  237 
via  Newton’s  method,  627 
scalarization,  178,  206,  306,  368 
duality  interpretation,  236 
multicriterion  problem,  183 
scaling,  38 

Schur  complement,  76,  88,  124,  133,  546, 
650,  672 

SDP,  see  semidefinite  program 
search  direction,  463 
Newton,  484,  525 
primal-dual,  609 
second  derivative,  643 
chain  rule,  645 
second-order 
conditions 
convexity,  71 
log-convexity,  105 
quasiconvexity,  101 
cone,  31,  449 

generalized  logarithm,  597 
cone  program,  156 
barrier  method,  601 
central  path,  599 
complexity,  606 
dual,  287 
segment,  21 

self-concordance,  496,  516 

barrier  method  complexity,  585 
composition,  499 
conjugate  function,  517 
Newton  method  with  equality  constraints, 
531 

semidefinite  program,  168,  201 
barrier  method,  602,  618 
central  path,  600 
complex,  202 
complexity,  608 
dual,  265 
relaxation 

partitioning  problem,  285 
sensitivity  analysis,  250 

geometric  program,  284 
separable 

block,  552 
function,  249 
separating 

affine  and  convex  set,  49 
cones,  66 

convex  sets,  403,  422 
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hyperplane,  46,  195,  423 
converse  theorem,  50 
duality  proof,  235 
polyhedra,  278 
theorem  proof,  46 
point  and  convex  set,  49,  399 
point  and  polyhedron,  401 
sphere,  195 
strictly,  49 
set 

affine,  21 
boundary,  638 
closed,  637 
closure,  638 
condition  number,  461 
convex,  23 

distance  between,  402 
distance  to,  397 
eccentricity,  461 
expanded,  61 
hyperbolic,  61 
intersection,  36 
open,  637 
projection,  397 
rectangle,  61 
restricted,  61 
slab,  61 
sublevel,  75 
sum,  38 
superlevel,  75 
wedge,  61 
width,  461 

shadow  price,  241,  253 
signomial,  200 
simplex,  32 

probability,  33 
unit,  33 
volume,  407 
singular  value,  82 

decomposition,  648 
slab,  61 

slack  variable,  131 
Slater’s  condition,  226 

generalized  inequalities,  265 
proof  of  strong  duality,  234 
smoothing,  307,  310 
quadratic,  312 

SOCP,  see  second-order  cone  program 
solid  cone,  43 
solution  set 

linear  equations,  22 
linear  inequality,  27 
linear  matrix  inequality,  38 
quadratic  inequality,  61 
strict  linear  inequalities,  63 
SOS  (sum  of  squares),  203 


sparse 

approximation,  333 
description,  334 
matrix,  511 

Cholesky  factorization,  670 
LU  factorization,  669 
solution,  304 
vectors,  663 
spectral 

decomposition,  646 
norm,  636 
dual,  637 
minimization,  169 

sphere 

separating,  195 
spline,  327 

fitting,  331 

spread  of  eigenvalues,  203 
square-root  of  matrix,  647 
standard  form 

cone  program,  168 
dual,  266 

linear  program,  146 
dual,  224 

standard  inner  product,  633 
S",  633 

statistical  estimation,  351 
steepest  descent  method,  475 
^i-norm,  477 
step  length,  463 

stopping  criterion  via  duality,  242 
strict 

linear  inequalities,  63 
separation,  49 
strong 

alternatives,  260 
convexity,  459,  558 
duality,  226 

linear  program,  280 
max-min  property,  238 

convex- concave  function,  281 
sublevel  set,  75 

closedness  assumption,  457 
condition  number,  461 
suboptimality 

certificate,  241 
condition,  460 
substitution  of  variable,  130 
sum 

of  k  largest,  80 
conjugate,  120 
solving  via  dual,  278 
of  squares,  203 
partial,  62 
sets,  38 

sum-absolute-value  norm,  635 
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SUMT  (sequential  unconstrained  minimiza¬ 
tion  method),  569 
superlevel  set,  75 
support  function,  63,  81,  92,  120 
projection  and  separation,  401 
support  vector  classifier,  425 
supporting  hyperplane,  50 
converse  theorem,  63 
KKT  conditions,  283 
theorem,  51 
supremum,  638 
surface 

area,  159 

optimal  trade-off,  182 
surrogate  duality  gap,  612 
SVD  (singular  value  decomposition),  648 
symbolic  factorization,  511 
symmetry,  189 

constraint,  442 

theorem 

alternatives,  50,  54,  258 
generalized  inequalities,  269 
eigenvalue  interlacing,  122 
Gauss-Markov,  188 
Motzkin,  447 
Perron-Frobenius,  165 
Riesz-Fejer,  348 
separating  hyperplane,  46 
Slater,  226 

supporting  hyperplane,  51 
Tikhonov  regularization,  306 
time-frequency  analysis,  334 
total  variation  reconstruction,  312 
trade-off  analysis,  182 
transaction  fee,  155 
translation,  38 
triangle  inequality,  634 
triangularization,  326 
trigonometric  polynomial,  116,  326 
trust  region,  302 

Newton  method,  515 
problem,  229 

two-stage  optimization,  519 
two-way  partitioning  problem,  see  partition¬ 
ing  problem 


upper  triangular  matrix,  665 
utility  function,  115,  130,  211,  339 

variable 

change  of,  130 
dual,  215 
elimination,  672 
explanatory,  353 
optimization,  127 
slack,  131 
vector 

normal,  27 
optimization,  174 
scalarization,  178 
verification,  10 
volume 

ellipsoid,  407 
polyhedron,  108 
simplex,  407 

Von  Neuman  growth  problem,  152 
Voronoi  region,  60 

water-filling  method,  245 
weak 

alternatives,  258 
duality,  225 

infeasible  problems,  273 
max-min  inequality,  281 
wedge,  61 
weight  vector,  179 
weighted 

least-squares,  5 
norm  approximation,  293 
well  conditioned  basis,  407 
width,  461 

wireless  communication  system,  196 
Wishart  distribution,  105 
worst-case 

analysis,  10 

robust  approximation,  319 

yield  function,  107,  211 
Young’s  inequality,  94,  120 

Z  (integers),  697 


unbounded  below,  128 
uncertainty  ellipsoid,  322 
unconstrained  minimization,  457 
method,  568 

underdetermined  linear  equations,  681 
uniform  distribution,  105 
unimodal  function,  95 
unit 

ball,  634 
simplex,  33 


